TileDB-SOMA Python API Reference

SOMA — for stack of matrices, annotated — is a unified data model and API for single-cell data.

If you know about obs, var, and X, you’ll recognize what you’re seeing.

The data model and API — here as implemented using the TileDB storage engine — allow you to persist, investigate, and share annotated 2D matrices, commonly used in single-cell biology.

Features:

  • Flexible, extensible, and open-source API

  • Supports access to persistent, cloud-resident annotated 2D matrix datasets

  • Enables use within popular data science environments (e.g., R, Python), using the tools of that environment (e.g., Python Pandas integration), with the same storage regardless of language

  • Allows interoperability with multiple tools including AnnData, Scanpy, Seurat, and Bioconductor

  • Cloud-native TileDB arrays allow you to slice straight from remote storage

  • Reduces costs and processing time by utilizing cost-efficient object storage services like S3

  • Enables out-of-core access to data aggregations much larger than single-host main memory

  • Enables distributed computation over datasets

Modules

Typical usage of the Python interface to TileDB-SOMA will use the top-level module tiledbsoma, e.g.

import tiledbsoma

There is also a submodule io which contains logic for importing data from AnnData to SOMA structure, and exporting back to AnnData.

import tiledbsoma.io

SOMA

class tiledbsoma.SOMA(uri: str, *, name: Optional[str] = None, soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, config: Optional[tiledb.ctx.Config] = None, ctx: Optional[tiledb.ctx.Ctx] = None, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Class for representing a group of TileDB groups/arrays that constitute an SOMA (‘stack of matrices, annotated’) which includes:

  • X (group of AssayMatrixGroup): a group of one or more labeled 2D sparse arrays that share the same dimensions.

  • obs (AnnotationDataframe): 1D labeled array with column labels for X

  • var (AnnotationDataframe): 1D labeled array with row labels for X

  • obsm (group of AnnotationMatrix): multi-attribute arrays keyed by IDs of obs

  • varm (group of AnnotationMatrix): multi-attribute arrays keyed by IDs of var

  • obsp (group of AnnotationMatrix): 2D arrays keyed by IDs of obs

  • varp (group of AnnotationMatrix): 2D arrays keyed by IDs of var

  • raw: contains raw versions of X and varm

  • uns: nested, unstructured data

Convenience accessors include:

  • soma.obs_keys() for soma.obs_names for soma.obs.ids()

  • soma.var_keys() for soma.var_names for soma.var.ids()

  • soma.n_obs for soma.obs.shape()[0]

  • soma.n_var for soma.var.shape()[0]

add_X_layer(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], layer_name: str = 'data') None

Populates the X or raw.X subgroup for a SOMA object.

dim_slice(obs_ids: Optional[Union[List[str], List[bytes]]], var_ids: Optional[Union[List[str], List[bytes]]], *, return_arrow: bool = False) Optional[tiledbsoma.soma_slice.SOMASlice]

Subselects the SOMA’s obs, var, and X/data using the specified obs_ids and var_ids. Using a value of None for obs_ids means use all obs_ids, and likewise for var_ids. Returns None for empty slice.

classmethod from_soma_slice(soma_slice: tiledbsoma.soma_slice.SOMASlice, uri: str, name: Optional[str] = None, soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, config: Optional[tiledb.ctx.Config] = None, ctx: Optional[tiledb.ctx.Ctx] = None, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None) tiledbsoma.soma.SOMA

Constructs SOMA storage from a given in-memory SOMASlice object.

get_obs_value_counts(obs_label: str) pandas.core.frame.DataFrame

Given an obs label, e.g. cell_type, returns a dataframe count the number of different values for that label in the SOMA.

get_var_value_counts(var_label: str) pandas.core.frame.DataFrame

Given a var label, e.g. feature_name, returns a dataframe count the number of different values for that label in the SOMA.

obs_keys() Sequence[str]

An alias for soma.obs.ids().

classmethod queries(somas: Sequence[tiledbsoma.soma.SOMA], *, obs_attrs: Optional[Sequence[str]] = None, obs_query_string: Optional[str] = None, obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_attrs: Optional[Sequence[str]] = None, var_query_string: Optional[str] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None, X_layer_names: Optional[Sequence[str]] = None, return_arrow: bool = False, max_thread_pool_workers: Optional[int] = None) List[tiledbsoma.soma_slice.SOMASlice]

Subselects the obs, var, and X/data using the specified queries on obs and var, concatenating across SOMAs in the list. Queries use the TileDB-Py QueryCondition API.

If obs_query_string is None, the obs dimension is not filtered and all of obs is used; similiarly for var. Return value of None indicates an empty slice. If obs_ids or var_ids are not None, they are effectively ANDed into the query. For example, you can pass in a known list of obs_ids, then use obs_query_string to further restrict the query.

If obs_attrs or var_attrs are unspecified, slices will take all obs/var attributes from their source SOMAs; if they are specified, slices will take the specified obs/var attributes. If all SOMAs in the collection have the same obs/var attributes, then you needn’t specify these; if they don’t, you must.

If X_layer_names is None, they are all returned; otherwise you can specify which layer(s) you want to be operated on.

query(*, obs_attrs: Optional[Sequence[str]] = None, obs_query_string: Optional[str] = None, obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_attrs: Optional[Sequence[str]] = None, var_query_string: Optional[str] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None, X_layer_names: Optional[Sequence[str]] = None, return_arrow: bool = False) Optional[tiledbsoma.soma_slice.SOMASlice]

Subselects the SOMA’s obs, var, and X/data using the specified queries on obs and var. Queries use the TileDB-Py QueryCondition API.

If obs_query_string is None, the obs dimension is not filtered and all of obs is used; similiarly for var.

If obs_attrs or var_attrs are unspecified, the slice will take all obs/var attributes from the source SOMAs; if they are specified, the slice will take the specified obs/var

If X_layer_names is None, they are all returned; otherwise you can specify which layer(s) you want to be operated on.

var_keys() Sequence[str]

An alias for soma.var.ids().

SOMACollection

class tiledbsoma.SOMACollection(uri: str, *, name: str = 'soco', soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, config: Optional[tiledb.ctx.Config] = None, ctx: Optional[tiledb.ctx.Ctx] = None, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Implements a collection of SOMA objects.

add(soma: tiledbsoma.soma.SOMA, relative: Optional[bool] = None) None

Adds a SOMA to the SOMACollection.

  • If relative is not supplied, it’s taken from the soma_options the collection was instantiated with.

  • If relative is False, either via the relative argument or via soma_options.member_uris_are_relative, then the collection will have the absolute path of the SOMA. For populating SOMA elements within a SOMACollection on local disk, this can be useful if you want to be able to move the SOMACollection storage around and have it remember the (unmoved) locations of SOMA objects elsewhere, i.e. if the SOMACollection is in one place while its members are in other places. If the SOMAs in the collection are contained within the SOMACollection directory, you probably want relative=True.

  • If relative is True, either via the relative argument or via soma_options.member_uris_are_relative, then the group will have the relative path of the member. For TileDB Cloud, this is never the right thing to do. For local-disk or S3 storage, this is essential if you want to move a SOMA to another directory and have it remember the locations of the members within it. In this case the SOMA storage must be located as a direct component of the collection storage. Example: s3://mybucket/soco and s3://mybucket/soco/soma1.

  • If relative is None, either via the relative argument or via soma_options.member_uris_are_relative, then we select relative=False if the URI starts with tiledb://, else we select relative=True. This is the default.

find_unique_obs_values(obs_label: str) Set[str]

Given an obs label such as cell_type or tissue, returns a set of unique values for that label among all SOMAs in the collection.

find_unique_var_values(var_label: str) Set[str]

Given an var label such as feature_name, returns a set of unique values for that label among all SOMAs in the collection.

keys() Sequence[str]

Returns the names of the SOMAs in the collection.

query(*, obs_attrs: Optional[Sequence[str]] = None, obs_query_string: Optional[str] = None, obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_attrs: Optional[Sequence[str]] = None, var_query_string: Optional[str] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None, X_layer_names: Optional[Sequence[str]] = None, return_arrow: bool = False) List[tiledbsoma.soma_slice.SOMASlice]

Subselects the obs, var, and X/data using the specified queries on obs and var, concatenating across SOMAs in the collection. Queries use the TileDB-Py QueryCondition API.

If obs_query_string is None, the obs dimension is not filtered and all of obs is used; similiarly for var. Return value of None indicates an empty slice. If obs_ids or var_ids are not None, they are effectively ANDed into the query. For example, you can pass in a known list of obs_ids, then use obs_query_string to further restrict the query.

If X_layer_names is None, they are all returned; otherwise you can specify which layer(s) you want to be operated on.

If obs_attrs or var_attrs are unspecified, slices will take all obs/var attributes from their source SOMAs; if they are specified, slices will take the specified obs/var attributes. If all SOMAs in the collection have the same obs/var attributes, then you needn’t specify these; if they don’t, you must.

remove(soma: Union[tiledbsoma.soma.SOMA, str]) None

Removes a SOMA from the SOMACollection, when invoked as soco.remove("namegoeshere").

SOMASlice

class tiledbsoma.SOMASlice(X: Dict[str, Union[pandas.core.frame.DataFrame, pyarrow.lib.Table, numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix]], obs: Union[pandas.core.frame.DataFrame, pyarrow.lib.Table], var: Union[pandas.core.frame.DataFrame, pyarrow.lib.Table])

In-memory-only object for ephemeral extracting out of a SOMA. Can be used to construct a SOMA but is not a SOMA (which would entail out-of-memory storage). This is simply a collection of either pandas.DataFrame or pyarrow.Table objects.

classmethod concat(soma_slices: Sequence[tiledbsoma.soma_slice.SOMASlice]) Optional[tiledbsoma.soma_slice.SOMASlice]

Concatenates multiple SOMASlice objects into a single one. Implemented using AnnData’s concat. Requires that all slices share the same obs and var keys. Please see the SOMA class method find_common_obs_and_var_keys.

to_anndata() anndata._core.anndata.AnnData

Constructs an AnnData object from the current SOMASlice object.

I/O functions

tiledbsoma.io.from_h5ad(soma: tiledbsoma.soma.SOMA, input_path: Union[str, pathlib.Path], X_layer_name: str = 'data', schema_only: bool = False) None

Reads an .h5ad local-disk file and writes to a TileDB SOMA structure.

tiledbsoma.io.from_anndata(soma: tiledbsoma.soma.SOMA, anndata: anndata._core.anndata.AnnData, X_layer_name: str = 'data', schema_only: bool = False) None

Given an in-memory AnnData object, writes to a TileDB SOMA structure.

tiledbsoma.io.to_h5ad(soma: tiledbsoma.soma.SOMA, h5ad_path: Union[str, pathlib.Path], X_layer_name: str = 'data') None

Converts the soma group to anndata format and writes it to the specified .h5ad file. As of 2022-05-05 this is an incomplete prototype.

tiledbsoma.io.to_anndata(soma: tiledbsoma.soma.SOMA, X_layer_name: str = 'data') anndata._core.anndata.AnnData

Converts the soma group to anndata. Choice of matrix formats is following what we often see in input .h5ad files:

  • X as scipy.sparse.csr_matrix

  • obs, var as pandas.dataframe

  • obsm, varm arrays as numpy.ndarray

  • obsp, varp arrays as scipy.sparse.csr_matrix

Options

class tiledbsoma.SOMAOptions(obs_extent: int = 256, var_extent: int = 2048, X_data_row_filters: typing.List[tiledb.filter.Filter] = <factory>, X_data_col_filters: typing.List[tiledb.filter.Filter] = <factory>, X_data_offset_filters: typing.List[tiledb.filter.Filter] = <factory>, X_data_attr_filters: typing.List[tiledb.filter.Filter] = <factory>, X_capacity: int = 1000, X_tile_order: str = 'row-major', X_cell_order: str = 'row-major', string_dim_zstd_level: int = 3, write_X_chunked: bool = True, goal_chunk_nnz: int = 20000000, member_uris_are_relative: typing.Optional[bool] = None, max_thread_pool_workers: int = 8)

A place to put configuration options various users may wish to change. These are mainly TileDB array-schema parameters.

tiledbsoma.logging.debug() None

Sets tiledbsoma logging to a DEBUG level. Use tiledbsoma.logging.debug() in notebooks to see more detailed progress indicators for data ingestion.

tiledbsoma.logging.info() None

Sets tiledbsoma logging to an INFO level. Use tiledbsoma.logging.info() in notebooks to see progress indicators for data ingestion.

tiledbsoma.logging.log_io(info_message: Optional[str], debug_message: str) None

Data-ingestion timeframes range widely. Some folks won’t want details in the former; some will want details in the latter. For I/O and for I/O only, it’s helpful to print a short message at INFO level, or a different, longer message at/beyond DEBUG level.

tiledbsoma.logging.warning() None

Sets tiledbsoma logging to a WARNING level. Use tiledbsoma.logging.info() in notebooks to suppress progress indicators for data ingestion.

SOMA-element classes

class tiledbsoma.AssayMatrixGroup(uri: str, name: str, row_dim_name: str, col_dim_name: str, row_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, col_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Nominally for X and raw/X elements. You can find element names using soma.X.keys(); you access elements using soma.X['data'] etc., or soma.X.data if you prefer. (The latter syntax is possible when the element name doesn’t have dashes, dots, etc. in it.)

add_layer_from_matrix_and_dim_values(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], row_names: Union[Sequence[str], pandas.core.indexes.base.Index], col_names: Union[Sequence[str], pandas.core.indexes.base.Index], layer_name: str = 'data', *, schema_only: bool = False) None

Populates the X or raw.X subgroup for a SOMA object. For X and raw.X, nominally row_names will be anndata.obs_names and col_names will be anndata.var_names or anndata.raw.var_names. For obsp elements, both will be anndata.obs_names; for varp elements, both will be ``anndata.var_names.

keys() Sequence[str]

For obsm and varm, .keys() is a keystroke-saver for the more general group-member accessor .get_member_names().

class tiledbsoma.AssayMatrix(uri: str, name: str, row_dim_name: str, col_dim_name: str, row_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, col_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Wraps a TileDB sparse array with two string dimensions. Used for X, raw.X, obsp elements, and varp elements.

csc(obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None) scipy.sparse._csc.csc_matrix

Like .df() but returns results in scipy.sparse.csc_matrix format.

csr(obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None) scipy.sparse._csr.csr_matrix

Like .df() but returns results in scipy.sparse.csr_matrix format.

df(obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table]

Keystroke-saving alias for .dim_select(). If either of obs_ids or var_ids are provided, they’re used to subselect; if not, the entire dataframe is returned.

dim_select(obs_ids: Optional[Union[List[str], List[bytes]]], var_ids: Optional[Union[List[str], List[bytes]]], *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table]

Selects a slice out of the matrix with specified obs_ids and/or var_ids. Either or both of the ID lists may be None, meaning, do not subselect along that dimension. If both ID lists are None, the entire matrix is returned.

from_matrix_and_dim_values(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], row_names: Union[Sequence[str], pandas.core.indexes.base.Index], col_names: Union[Sequence[str], pandas.core.indexes.base.Index], schema_only: bool = False) None

Imports a matrix — nominally scipy.sparse.csr_matrix or numpy.ndarray — into a TileDB array which is used for X, raw.X, obsp members, and varp members.

The row_names and col_names are row and column labels for the matrix; the matrix may be scipy.sparse.csr_matrix, scipy.sparse.csc_matrix, numpy.ndarray, etc. For ingest from AnnData, these should be ann.obs_names and ann.var_names.

ingest_data_cols_chunked(matrix: scipy.sparse._csc.csc_matrix, row_names: Union[numpy.ndarray, pandas.core.indexes.base.Index], col_names: Union[numpy.ndarray, pandas.core.indexes.base.Index]) None

Convert csc_matrix to coo_matrix chunkwise and ingest into TileDB.

Parameters
  • uri – TileDB URI of the array to be written.

  • matrixcsc_matrix.

  • row_names – List of row names.

  • col_names – List of column names.

ingest_data_dense_rows_chunked(matrix: numpy.ndarray, row_names: Union[numpy.ndarray, pandas.core.indexes.base.Index], col_names: Union[numpy.ndarray, pandas.core.indexes.base.Index]) None

Convert dense matrix to coo_matrix chunkwise and ingest into TileDB.

Parameters
  • uri – TileDB URI of the array to be written.

  • matrix – dense matrix.

  • row_names – List of row names.

  • col_names – List of column names.

ingest_data_rows_chunked(matrix: scipy.sparse._csr.csr_matrix, row_names: Union[numpy.ndarray, pandas.core.indexes.base.Index], col_names: Union[numpy.ndarray, pandas.core.indexes.base.Index]) None

Convert csr_matrix to coo_matrix chunkwise and ingest into TileDB.

Parameters
  • uri – TileDB URI of the array to be written.

  • matrixcsr_matrix.

  • row_names – List of row names.

  • col_names – List of column names.

ingest_data_whole(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], row_names: Union[numpy.ndarray, pandas.core.indexes.base.Index], col_names: Union[numpy.ndarray, pandas.core.indexes.base.Index]) None

Convert numpy.ndarray, scipy.sparse.csr_matrix, or scipy.sparse.csc_matrix to COO matrix and ingest into TileDB.

Parameters
  • matrix – Matrix-like object coercible to a scipy COO matrix.

  • row_names – List of row names.

  • col_names – List of column names.

shape() Tuple[int, int]

Returns a tuple with the number of rows and number of columns of the AssayMatrix. In TileDB storage, these are string-indexed sparse arrays for which no .shape() exists, but, we draw from the appropriate obs, var, raw/var, etc. as appropriate for a given matrix.

Note: currently implemented via data scan — will be optimized for TileDB core 2.10.

to_csr_matrix(row_labels: Union[Sequence[str], pandas.core.indexes.base.Index], col_labels: Union[Sequence[str], pandas.core.indexes.base.Index]) scipy.sparse._csr.csr_matrix

Reads the TileDB array storage for the storage and returns a sparse CSR matrix. The row/columns labels should be obs,var labels if the AssayMatrix is X, or obs,obs labels if the AssayMatrix is obsp, or var,var labels if the AssayMatrix is varp. Note in all cases that TileDB will have sorted the row and column labels; they won’t be in the same order as they were in any anndata object which was used to create the TileDB storage.

class tiledbsoma.AnnotationDataFrame(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Nominally for obs and var data within a soma. These have one string dimension, and multiple attributes.

df(ids: Optional[Union[List[str], List[bytes]]] = None, attrs: Optional[Sequence[str]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table]

Keystroke-saving alias for .dim_select(). If ids are provided, they’re used to subselect; if not, the entire dataframe is returned. If attrs are provided, they’re used for the query; else, all attributes are returned.

dim_select(ids: Optional[Union[List[str], List[bytes]]], attrs: Optional[Sequence[str]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table]

Selects a slice out of the dataframe with specified obs_ids (for obs) or var_ids (for var). If ids is None, the entire dataframe is returned. Similarly, if attrs are provided, they’re used for the query; else, all attributes are returned.

from_dataframe(dataframe: pandas.core.frame.DataFrame, *, extent: int = 2048, schema_only: bool = False) None

Populates the obs or var subgroup for a SOMA object.

Parameters
  • dataframeanndata.obs, anndata.var, anndata.raw.var.

  • extent – TileDB extent parameter for the array schema.

ids() Sequence[str]

Returns the obs_ids in the matrix (for obs) or the var_ids (for var).

keys() Sequence[str]

Returns the column names for the obs or var dataframe. For obs and varp, .keys() is a keystroke-saver for the more general array-schema accessor attr_names.

keyset() Set[str]

Same as .keys but returns as set.

query(query_string: Optional[str], ids: Optional[Union[List[str], List[bytes]]] = None, attrs: Optional[Sequence[str]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table]

Selects from obs/var using a TileDB-Py QueryCondition string such as cell_type == "blood". If attrs is None, returns all column names in the dataframe; use [] for attrs to select none of them. Any column names specified in the query_string must be included in attrs if attrs is not None. Returns None if the slice is empty.

shape() Tuple[int, int]

Returns a tuple with the number of rows and number of columns of the AnnotationDataFrame. The row-count is the number of obs_ids (for obs) or the number of var_ids (for var). The column-count is the number of columns/attributes in the dataframe.

class tiledbsoma.AnnotationMatrixGroup(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Nominally for soma obsm and varm. You can find element names using soma.obsm.keys(); you access elements using soma.obsm['X_pca'] etc., or soma.obsm.X_pca if you prefer. (The latter syntax is possible when the element name doesn’t have dashes, dots, etc. in it.)

add_matrix_from_matrix_and_dim_values(matrix: Union[pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], dim_values: Union[Sequence[str], pandas.core.indexes.base.Index], matrix_name: str, *, schema_only: bool = False) None

Populates a component of the obsm or varm subgroup for a SOMA object.

Parameters
  • matrix – element of anndata.obsm, anndata.varm, or anndata.raw.varm.

  • dim_values – anndata.obs_names, anndata.var_names, or anndata.raw.var_names.

  • matrix_name – name of the matrix, like "X_tsne" or "PCs".

keys() Sequence[str]

For obsm and varm, .keys() is a keystroke-saver for the more general group-member accessor .get_member_names().

remove(matrix_name: str) None

Removes a component of the obsm or varm subgroup for a SOMA object, when invoked as soma.obsm.remove("namegoeshere").

to_dict_of_csr() Dict[str, numpy.ndarray]

Reads the obsm/varm group-member arrays into a dict from name to member array. Member arrays are returned in sparse CSR format.

class tiledbsoma.AnnotationMatrix(uri: str, name: str, dim_name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Nominally for obsm and varm group elements within a SOMA.

df(ids: Optional[Union[List[str], List[bytes]]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table]

Keystroke-saving alias for .dim_select(). If ids are provided, they’re used to subselect; if not, the entire dataframe is returned.

dim_select(ids: Optional[Union[List[str], List[bytes]]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table]

Selects a slice out of the array with specified obs_ids (for obsm elements) or var_ids (for varm elements). If ids is None, the entire array is returned.

from_matrix_and_dim_values(matrix: Union[pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], dim_values: Union[Sequence[str], pandas.core.indexes.base.Index], schema_only: bool = False) None

Populates an array in the obsm/ or varm/ subgroup for a SOMA object.

Parameters
  • matrixanndata.obsm['foo'], anndata.varm['foo'], or anndata.raw.varm['foo'].

  • dim_valuesanndata.obs_names, anndata.var_names, or anndata.raw.var_names.

shape() Tuple[int, int]

Returns a tuple with the number of rows and number of columns of the AnnotationMatrix. The row-count is the number of obs_ids (for obsm elements) or the number of var_ids (for varm elements). The column-count is the number of columns/attributes in the dataframe.

Note: currently implemented via data scan — will be optimized in an upcoming TileDB Core release.

class tiledbsoma.AnnotationPairwiseMatrixGroup(uri: str, name: str, row_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, col_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Nominally for SOMA obsp and varp. You can find element names using soma.obsp.keys(); you access elements using soma.obsp['distances'] etc., or soma.obsp.distances if you prefer. (The latter syntax is possible when the element name doesn’t have dashes, dots, etc. in it.)

add_matrix_from_matrix_and_dim_values(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], dim_values: Union[Sequence[str], pandas.core.indexes.base.Index], matrix_name: str, *, schema_only: bool = False) None

Populates a component of the obsp or varp subgroup for a SOMA object.

Parameters
  • matrix – element of anndata.obsp or anndata.varp.

  • dim_values – anndata.obs_names or anndata.var_names.

  • matrix_name_name – name of the matrix, like "distances".

keys() Sequence[str]

For obsp and varp, .keys() is a keystroke-saver for the more general group-member accessor .get_member_names().

remove(matrix_name: str) None

Removes a component of the obsp or varp subgroup for a SOMA object. Implements del soma.obsp['distances'] etc.

to_dict_of_csr(obs_df_index: Union[Sequence[str], pandas.core.indexes.base.Index], var_df_index: Union[Sequence[str], pandas.core.indexes.base.Index]) Dict[str, scipy.sparse._csr.csr_matrix]

Reads the obsp or varp group-member arrays into a dict from name to member array. Member arrays are returned in sparse CSR format.

class tiledbsoma.RawGroup(uri: str, name: str, obs: tiledbsoma.annotation_dataframe.AnnotationDataFrame, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Nominally for soma raw.

from_anndata(anndata: anndata._core.anndata.AnnData, X_layer_name: str = 'data', *, schema_only: bool = False) None

Writes anndata.raw to a TileDB group structure.

to_anndata_raw(obs_labels: Union[Sequence[str], pandas.core.indexes.base.Index], X_layer_name: str = 'data') Tuple[scipy.sparse._csr.csr_matrix, pandas.core.frame.DataFrame, Dict[str, numpy.ndarray]]

Reads TileDB storage and returns the material for an anndata.Raw object. The obs_labels must be from the parent object.

class tiledbsoma.UnsGroup(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Nominally for soma uns.

from_anndata_uns(uns: Mapping[str, Any]) None

Populates the uns group for the soma object.

Parameters

uns – anndata.uns.

keys() Sequence[str]

For uns, .keys() is a keystroke-saver for the more general group-member accessor .get_member_names().

to_dict_of_matrices() Mapping[str, Any]

Reads the recursive group/array uns data from TileDB storage and returns them as a recursive dict of matrices.

class tiledbsoma.UnsArray(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Holds TileDB storage for an array obtained from the nested anndata.uns field.

create_empty_array_for_csr(attr_name: str, matrix_dtype: numpy.dtype, nrows: int, ncols: int) None

Create a TileDB 2D sparse array with int dimensions and a single attribute. Nominally used for uns data.

Parameters
  • matrix_dtype – datatype of the matrix

  • nrows – number of rows in the matrix

  • ncols – number of columns in the matrix

from_numpy_ndarray(arr: numpy.ndarray) None

Writes a numpy.ndarray to a TileDB array, nominally for ingest of uns nested data from anndata objects. Mostly tiledb.from_numpy, but with some necessary handling for data with UTF-8 values.

from_pandas_dataframe(df: pandas.core.frame.DataFrame) None

Ingests an UnsArray into TileDB storage, given a pandas.DataFrame.

from_scipy_csr(csr: scipy.sparse._csr.csr_matrix) None

Convert ndarray/(csr|csc)matrix to coo_matrix and ingest into TileDB.

Parameters

csr – Matrix-like object coercible to a scipy coo_matrix.

ingest_data_from_csr(csr: scipy.sparse._csr.csr_matrix) None

Convert ndarray/(csr|csc)matrix to coo_matrix and ingest into TileDB.

Parameters

csr – Matrix-like object coercible to a scipy coo_matrix.

to_matrix() numpy.ndarray

Reads an uns array from TileDB storage and returns a matrix – currently, always as numpy.ndarray.

Implementation-level classes

class tiledbsoma.TileDBArray(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)

Wraps arrays from TileDB-Py by retaining a URI, options, etc. Also serves as an abstraction layer to hide TileDB-specific details from the API, unless requested.

attr_names() Sequence[str]

Reads the attribute names from the schema: for example, the list of column names in a dataframe.

attr_names_to_types() Dict[str, str]

Returns a dict mapping from attribute name to attribute type.

dim_names() Sequence[str]

Reads the dimension names from the schema: for example, [‘obs_id’, ‘var_id’].

dim_names_to_types() Dict[str, str]

Returns a dict mapping from dimension name to dimension type.

exists() bool

Tells whether or not there is storage for the array. This might be in case a SOMA object has not yet been populated, e.g. before calling from_anndata — or, if the SOMA has been populated but doesn’t have this member (e.g. not all SOMAs have a varp).

has_attr_name(attr_name: str) bool

Returns true if the array has the specified attribute name, false otherwise.

has_attr_names(attr_names: Sequence[str]) bool

Returns true if the array has all of the specified attribute names, false otherwise.

show_metadata(recursively: bool = True, indent: str = '') None

Shows metadata for the array.

tiledb_array_schema() tiledb.libtiledb.ArraySchema

Returns the TileDB array schema.

class tiledbsoma.TileDBGroup(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None, soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, ctx: Optional[tiledb.ctx.Ctx] = None)

Wraps groups from TileDB-Py by retaining a URI, options, etc.

create_unless_exists() None

Creates the TileDB group data structure on disk/S3/cloud, unless it already exists.

exists() bool

Tells whether or not there is storage for the group. This might be in case a SOMA object has not yet been populated, e.g. before calling from_anndata – or, if the SOMA has been populated but doesn’t have this member (e.g. not all SOMAs have a varp).

get_member_names() Sequence[str]

Returns the names of the group elements. For a SOMACollection, these will SOMA names; for a SOMA, these will be matrix/group names; etc.

get_member_names_to_uris() Dict[str, str]

Like get_member_names() and get_member_uris, but returns a dict mapping from member name to member URI.

get_member_uris() Sequence[str]

Returns the URIs of the group elements. For a SOMACollection, these will SOMA URIs; for a SOMA, these will be matrix/group URIs; etc.

show_metadata(recursively: bool = True, indent: str = '') None

Shows metadata for the group, recursively by default.

class tiledbsoma.TileDBObject(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None, soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, ctx: Optional[tiledb.ctx.Ctx] = None)

Base class for TileDBArray and TileDBGroup.

Manages soma_options, ctx, etc. which are common to both.

get_metadata(key: str) Any

Returns metadata associated with the group/array. Raises KeyError if there is no such key in the metadata.

get_object_type() str

Returns the class name associated with the group/array.

has_metadata(key: str) bool

Returns whether metadata is associated with the group/array.

metadata() Mapping[str, Any]

Returns metadata from the group/array as a dict.

metadata_keys() Sequence[str]

Returns metadata keys associated with the group/array.

set_metadata(key: str, value: Any) None

Returns metadata associated with the group/array.

class tiledbsoma.util.ETATracker

Computes estimated time to completion for chunked writes.

ingest_and_predict(chunk_percent: float, chunk_seconds: float) str

Updates from most recent chunk percent-done and chunk completion-seconds, then does a linear regression on all chunks done so far and estimates time to completion. :param chunk_percent: a percent done like 6.1 or 10.3. :param chunk_seconds: number of seconds it took to do the current chunk operation.

tiledbsoma.util.X_and_ids_to_sparse_matrix(Xdf: pandas.core.frame.DataFrame, row_dim_name: str, col_dim_name: str, attr_name: str, row_labels: Union[Sequence[str], pandas.core.indexes.base.Index], col_labels: Union[Sequence[str], pandas.core.indexes.base.Index], return_as: str = 'csr') Union[scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix]

This is needed when we read a TileDB X.df[:]. Since TileDB X is sparse 2D string-dimensioned, the return value of which is a dict with three columns – obs_id, var_id, and value. For conversion to anndata, we need make a sparse COO/IJV-format array where the indices are not strings but ints, matching the obs and var labels. The return_as parameter must be one of "csr" or "csc".

tiledbsoma.util.format_elapsed(start_stamp: float, message: str) str

Returns the message along with an elapsed-time indicator, with end time relative to start start from get_start_stamp. Used for annotating elapsed time of a task.

tiledbsoma.util.get_start_stamp() float

Returns information about start time of an event. Nominally float seconds since the epoch, but articulated here as being compatible with the format_elapsed function.

tiledbsoma.util.is_local_path(path: str) bool

Returns information about start time of an event. Nominally float seconds since the epoch, but articulated here as being compatible with the format_elapsed function.

tiledbsoma.util.is_soma(uri: str, ctx: Optional[tiledb.ctx.Ctx] = None) bool

Tells whether the URI points to a SOMA or not.

tiledbsoma.util.is_soma_collection(uri: str, ctx: Optional[tiledb.ctx.Ctx] = None) bool

Tells whether the URI points to a SOMACollection or not.

tiledbsoma.util.triples_to_dense_df(sparse_df: pandas.core.frame.DataFrame, fillna: float = 0.0) pandas.core.frame.DataFrame

Output from X dataframe reads is in “triples” format, e.g. two index columns obs_id and var_id, and data column value. This is the default format, and is appropriate for large, possibly sparse matrices. However, sometimes we want a dense matrix with obs_id row labels, var_id column labels, and value data. This function produces that.

tiledbsoma.util_ann.describe_ann_file(input_path: Union[str, pathlib.Path], show_summary: bool = True, show_types: bool = False, show_data: bool = False) None

This is an anndata-describer that goes a bit beyond what h5ls does for us. In particular, it shows us that for one HDF5 file we have anndata.X being of type numpy.ndarray while for another HDF5 file we have anndata.X being of type scipy.sparse.csr.csr_matrix. This is crucial information for building I/O logic that accepts a diversity of anndata HDF5 files.

tiledbsoma.util_tiledb.show_soma_schemas(soma_uri: str, ctx: Optional[tiledb.ctx.Ctx] = None) None

Show some summary information about an ingested TileDB Single-Cell Group. This tool goes a bit beyond print(tiledb.group.Group(soma_uri)) by also revealing array schema. Additionally, by employing encoded domain-specific knowleldge, it traverses items in the familiar order X, obs, var, etc. rather than using the general-purpose tiledb-group-display function.

tiledbsoma.util_tiledb.show_tiledb_group_array_schemas(uri: str, ctx: Optional[tiledb.ctx.Ctx] = None) None

Recursively show array schemas within a TileDB Group. This function is not specific to single-cell matrix-API data, and won’t necessarily traverse items in a familiar application-specific order.