TileDB-SOMA Python API Reference¶
SOMA — for stack of matrices, annotated — is a unified data model and API for single-cell data.
If you know about obs
, var
, and X
, you’ll recognize what you’re seeing.
The data model and API — here as implemented using the TileDB storage engine — allow you to persist, investigate, and share annotated 2D matrices, commonly used in single-cell biology.
Features:
Flexible, extensible, and open-source API
Supports access to persistent, cloud-resident annotated 2D matrix datasets
Enables use within popular data science environments (e.g., R, Python), using the tools of that environment (e.g., Python Pandas integration), with the same storage regardless of language
Allows interoperability with multiple tools including AnnData, Scanpy, Seurat, and Bioconductor
Cloud-native TileDB arrays allow you to slice straight from remote storage
Reduces costs and processing time by utilizing cost-efficient object storage services like S3
Enables out-of-core access to data aggregations much larger than single-host main memory
Enables distributed computation over datasets
Modules¶
Typical usage of the Python interface to TileDB-SOMA will use the top-level module tiledbsoma
, e.g.
import tiledbsoma
There is also a submodule io
which contains logic for importing data from AnnData
to SOMA structure, and exporting back to AnnData
.
import tiledbsoma.io
SOMA¶
- class tiledbsoma.SOMA(uri: str, *, name: Optional[str] = None, soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, config: Optional[tiledb.ctx.Config] = None, ctx: Optional[tiledb.ctx.Ctx] = None, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Class for representing a group of TileDB groups/arrays that constitute an SOMA (‘stack of matrices, annotated’) which includes:
X
(group ofAssayMatrixGroup
): a group of one or more labeled 2D sparse arrays that share the same dimensions.obs
(AnnotationDataframe
): 1D labeled array with column labels forX
var
(AnnotationDataframe
): 1D labeled array with row labels forX
obsm
(group ofAnnotationMatrix
): multi-attribute arrays keyed by IDs ofobs
varm
(group ofAnnotationMatrix
): multi-attribute arrays keyed by IDs ofvar
obsp
(group ofAnnotationMatrix
): 2D arrays keyed by IDs ofobs
varp
(group ofAnnotationMatrix
): 2D arrays keyed by IDs ofvar
raw
: contains raw versions ofX
andvarm
uns
: nested, unstructured data
Convenience accessors include:
soma.obs_keys()
forsoma.obs_names
forsoma.obs.ids()
soma.var_keys()
forsoma.var_names
forsoma.var.ids()
soma.n_obs
forsoma.obs.shape()[0]
soma.n_var
forsoma.var.shape()[0]
- add_X_layer(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], layer_name: str = 'data') None ¶
Populates the
X
orraw.X
subgroup for aSOMA
object.
- dim_slice(obs_ids: Optional[Union[List[str], List[bytes]]], var_ids: Optional[Union[List[str], List[bytes]]], *, return_arrow: bool = False) Optional[tiledbsoma.soma_slice.SOMASlice] ¶
Subselects the SOMA’s
obs
,var
, andX/data
using the specifiedobs_ids
andvar_ids
. Using a value ofNone
for obs_ids means use allobs_ids
, and likewise forvar_ids
. ReturnsNone
for empty slice.
- classmethod from_soma_slice(soma_slice: tiledbsoma.soma_slice.SOMASlice, uri: str, name: Optional[str] = None, soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, config: Optional[tiledb.ctx.Config] = None, ctx: Optional[tiledb.ctx.Ctx] = None, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None) tiledbsoma.soma.SOMA ¶
Constructs
SOMA
storage from a given in-memorySOMASlice
object.
- get_obs_value_counts(obs_label: str) pandas.core.frame.DataFrame ¶
Given an obs label, e.g.
cell_type
, returns a dataframe count the number of different values for that label in the SOMA.
- get_var_value_counts(var_label: str) pandas.core.frame.DataFrame ¶
Given a var label, e.g.
feature_name
, returns a dataframe count the number of different values for that label in the SOMA.
- classmethod queries(somas: Sequence[tiledbsoma.soma.SOMA], *, obs_attrs: Optional[Sequence[str]] = None, obs_query_string: Optional[str] = None, obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_attrs: Optional[Sequence[str]] = None, var_query_string: Optional[str] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None, X_layer_names: Optional[Sequence[str]] = None, return_arrow: bool = False, max_thread_pool_workers: Optional[int] = None) List[tiledbsoma.soma_slice.SOMASlice] ¶
Subselects the obs, var, and X/data using the specified queries on obs and var, concatenating across SOMAs in the list. Queries use the TileDB-Py
QueryCondition
API.If
obs_query_string
isNone
, theobs
dimension is not filtered and all ofobs
is used; similiarly forvar
. Return value ofNone
indicates an empty slice. Ifobs_ids
orvar_ids
are notNone
, they are effectively ANDed into the query. For example, you can pass in a known list ofobs_ids
, then useobs_query_string
to further restrict the query.If
obs_attrs
orvar_attrs
are unspecified, slices will take allobs
/var
attributes from their source SOMAs; if they are specified, slices will take the specifiedobs
/var
attributes. If all SOMAs in the collection have the sameobs
/var
attributes, then you needn’t specify these; if they don’t, you must.If
X_layer_names
is None, they are all returned; otherwise you can specify which layer(s) you want to be operated on.
- query(*, obs_attrs: Optional[Sequence[str]] = None, obs_query_string: Optional[str] = None, obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_attrs: Optional[Sequence[str]] = None, var_query_string: Optional[str] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None, X_layer_names: Optional[Sequence[str]] = None, return_arrow: bool = False) Optional[tiledbsoma.soma_slice.SOMASlice] ¶
Subselects the SOMA’s obs, var, and X/data using the specified queries on obs and var. Queries use the TileDB-Py
QueryCondition
API.If
obs_query_string
isNone
, theobs
dimension is not filtered and all ofobs
is used; similiarly forvar
.If
obs_attrs
orvar_attrs
are unspecified, the slice will take allobs
/var
attributes from the source SOMAs; if they are specified, the slice will take the specifiedobs
/var
If
X_layer_names
is None, they are all returned; otherwise you can specify which layer(s) you want to be operated on.
SOMACollection¶
- class tiledbsoma.SOMACollection(uri: str, *, name: str = 'soco', soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, config: Optional[tiledb.ctx.Config] = None, ctx: Optional[tiledb.ctx.Ctx] = None, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Implements a collection of
SOMA
objects.- add(soma: tiledbsoma.soma.SOMA, relative: Optional[bool] = None) None ¶
Adds a
SOMA
to theSOMACollection
.If
relative
is not supplied, it’s taken from thesoma_options
the collection was instantiated with.If
relative
isFalse
, either via therelative
argument or viasoma_options.member_uris_are_relative
, then the collection will have the absolute path of the SOMA. For populating SOMA elements within a SOMACollection on local disk, this can be useful if you want to be able to move the SOMACollection storage around and have it remember the (unmoved) locations of SOMA objects elsewhere, i.e. if the SOMACollection is in one place while its members are in other places. If the SOMAs in the collection are contained within the SOMACollection directory, you probably wantrelative=True
.If
relative
isTrue
, either via therelative
argument or viasoma_options.member_uris_are_relative
, then the group will have the relative path of the member. For TileDB Cloud, this is never the right thing to do. For local-disk or S3 storage, this is essential if you want to move a SOMA to another directory and have it remember the locations of the members within it. In this case the SOMA storage must be located as a direct component of the collection storage. Example:s3://mybucket/soco
ands3://mybucket/soco/soma1
.If
relative
isNone
, either via therelative
argument or viasoma_options.member_uris_are_relative
, then we selectrelative=False
if the URI starts withtiledb://
, else we selectrelative=True
. This is the default.
- find_unique_obs_values(obs_label: str) Set[str] ¶
Given an
obs
label such ascell_type
ortissue
, returns a set of unique values for that label among all SOMAs in the collection.
- find_unique_var_values(var_label: str) Set[str] ¶
Given an
var
label such asfeature_name
, returns a set of unique values for that label among all SOMAs in the collection.
- query(*, obs_attrs: Optional[Sequence[str]] = None, obs_query_string: Optional[str] = None, obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_attrs: Optional[Sequence[str]] = None, var_query_string: Optional[str] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None, X_layer_names: Optional[Sequence[str]] = None, return_arrow: bool = False) List[tiledbsoma.soma_slice.SOMASlice] ¶
Subselects the
obs
,var
, andX/data
using the specified queries onobs
andvar
, concatenating across SOMAs in the collection. Queries use the TileDB-PyQueryCondition
API.If
obs_query_string
isNone
, theobs
dimension is not filtered and all ofobs
is used; similiarly forvar
. Return value ofNone
indicates an empty slice. Ifobs_ids
orvar_ids
are notNone
, they are effectively ANDed into the query. For example, you can pass in a known list ofobs_ids
, then useobs_query_string
to further restrict the query.If
X_layer_names
is None, they are all returned; otherwise you can specify which layer(s) you want to be operated on.If
obs_attrs
orvar_attrs
are unspecified, slices will take allobs
/var
attributes from their source SOMAs; if they are specified, slices will take the specifiedobs
/var
attributes. If all SOMAs in the collection have the sameobs
/var
attributes, then you needn’t specify these; if they don’t, you must.
- remove(soma: Union[tiledbsoma.soma.SOMA, str]) None ¶
Removes a
SOMA
from theSOMACollection
, when invoked assoco.remove("namegoeshere")
.
SOMASlice¶
- class tiledbsoma.SOMASlice(X: Dict[str, Union[pandas.core.frame.DataFrame, pyarrow.lib.Table, numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix]], obs: Union[pandas.core.frame.DataFrame, pyarrow.lib.Table], var: Union[pandas.core.frame.DataFrame, pyarrow.lib.Table])¶
In-memory-only object for ephemeral extracting out of a SOMA. Can be used to construct a SOMA but is not a SOMA (which would entail out-of-memory storage). This is simply a collection of either
pandas.DataFrame
orpyarrow.Table
objects.- classmethod concat(soma_slices: Sequence[tiledbsoma.soma_slice.SOMASlice]) Optional[tiledbsoma.soma_slice.SOMASlice] ¶
Concatenates multiple
SOMASlice
objects into a single one. Implemented usingAnnData
’sconcat
. Requires that all slices share the sameobs
andvar
keys. Please see theSOMA
class methodfind_common_obs_and_var_keys
.
- to_anndata() anndata._core.anndata.AnnData ¶
Constructs an
AnnData
object from the currentSOMASlice
object.
I/O functions¶
- tiledbsoma.io.from_h5ad(soma: tiledbsoma.soma.SOMA, input_path: Union[str, pathlib.Path], X_layer_name: str = 'data', schema_only: bool = False) None ¶
Reads an
.h5ad
local-disk file and writes to a TileDB SOMA structure.
- tiledbsoma.io.from_anndata(soma: tiledbsoma.soma.SOMA, anndata: anndata._core.anndata.AnnData, X_layer_name: str = 'data', schema_only: bool = False) None ¶
Given an in-memory
AnnData
object, writes to a TileDB SOMA structure.
- tiledbsoma.io.to_h5ad(soma: tiledbsoma.soma.SOMA, h5ad_path: Union[str, pathlib.Path], X_layer_name: str = 'data') None ¶
Converts the soma group to anndata format and writes it to the specified .h5ad file. As of 2022-05-05 this is an incomplete prototype.
- tiledbsoma.io.to_anndata(soma: tiledbsoma.soma.SOMA, X_layer_name: str = 'data') anndata._core.anndata.AnnData ¶
Converts the soma group to anndata. Choice of matrix formats is following what we often see in input
.h5ad
files:X as
scipy.sparse.csr_matrix
obs
,var
aspandas.dataframe
obsm
,varm
arrays asnumpy.ndarray
obsp
,varp
arrays asscipy.sparse.csr_matrix
Options¶
- class tiledbsoma.SOMAOptions(obs_extent: int = 256, var_extent: int = 2048, X_data_row_filters: typing.List[tiledb.filter.Filter] = <factory>, X_data_col_filters: typing.List[tiledb.filter.Filter] = <factory>, X_data_offset_filters: typing.List[tiledb.filter.Filter] = <factory>, X_data_attr_filters: typing.List[tiledb.filter.Filter] = <factory>, X_capacity: int = 1000, X_tile_order: str = 'row-major', X_cell_order: str = 'row-major', string_dim_zstd_level: int = 3, write_X_chunked: bool = True, goal_chunk_nnz: int = 20000000, member_uris_are_relative: typing.Optional[bool] = None, max_thread_pool_workers: int = 8)¶
A place to put configuration options various users may wish to change. These are mainly TileDB array-schema parameters.
- tiledbsoma.logging.debug() None ¶
Sets tiledbsoma logging to a DEBUG level. Use
tiledbsoma.logging.debug()
in notebooks to see more detailed progress indicators for data ingestion.
- tiledbsoma.logging.info() None ¶
Sets tiledbsoma logging to an INFO level. Use
tiledbsoma.logging.info()
in notebooks to see progress indicators for data ingestion.
- tiledbsoma.logging.log_io(info_message: Optional[str], debug_message: str) None ¶
Data-ingestion timeframes range widely. Some folks won’t want details in the former; some will want details in the latter. For I/O and for I/O only, it’s helpful to print a short message at
INFO
level, or a different, longer message at/beyondDEBUG
level.
SOMA-element classes¶
- class tiledbsoma.AssayMatrixGroup(uri: str, name: str, row_dim_name: str, col_dim_name: str, row_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, col_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Nominally for
X
andraw/X
elements. You can find element names usingsoma.X.keys()
; you access elements usingsoma.X['data']
etc., orsoma.X.data
if you prefer. (The latter syntax is possible when the element name doesn’t have dashes, dots, etc. in it.)- add_layer_from_matrix_and_dim_values(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], row_names: Union[Sequence[str], pandas.core.indexes.base.Index], col_names: Union[Sequence[str], pandas.core.indexes.base.Index], layer_name: str = 'data', *, schema_only: bool = False) None ¶
Populates the
X
orraw.X
subgroup for aSOMA
object. ForX
andraw.X
, nominallyrow_names
will beanndata.obs_names
andcol_names
will beanndata.var_names
oranndata.raw.var_names
. Forobsp
elements, both will beanndata.obs_names
; forvarp elements, both will be ``anndata.var_names
.
- class tiledbsoma.AssayMatrix(uri: str, name: str, row_dim_name: str, col_dim_name: str, row_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, col_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Wraps a TileDB sparse array with two string dimensions. Used for
X
,raw.X
,obsp
elements, andvarp
elements.- csc(obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None) scipy.sparse._csc.csc_matrix ¶
Like
.df()
but returns results inscipy.sparse.csc_matrix
format.
- csr(obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None) scipy.sparse._csr.csr_matrix ¶
Like
.df()
but returns results inscipy.sparse.csr_matrix
format.
- df(obs_ids: Optional[Union[List[str], List[bytes]]] = None, var_ids: Optional[Union[List[str], List[bytes]]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table] ¶
Keystroke-saving alias for
.dim_select()
. If either ofobs_ids
orvar_ids
are provided, they’re used to subselect; if not, the entire dataframe is returned.
- dim_select(obs_ids: Optional[Union[List[str], List[bytes]]], var_ids: Optional[Union[List[str], List[bytes]]], *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table] ¶
Selects a slice out of the matrix with specified
obs_ids
and/orvar_ids
. Either or both of the ID lists may beNone
, meaning, do not subselect along that dimension. If both ID lists areNone
, the entire matrix is returned.
- from_matrix_and_dim_values(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], row_names: Union[Sequence[str], pandas.core.indexes.base.Index], col_names: Union[Sequence[str], pandas.core.indexes.base.Index], schema_only: bool = False) None ¶
Imports a matrix — nominally
scipy.sparse.csr_matrix
ornumpy.ndarray
— into a TileDB array which is used forX
,raw.X
,obsp
members, andvarp
members.The
row_names
andcol_names
are row and column labels for the matrix; the matrix may bescipy.sparse.csr_matrix
,scipy.sparse.csc_matrix
,numpy.ndarray
, etc. For ingest fromAnnData
, these should beann.obs_names
andann.var_names
.
- ingest_data_cols_chunked(matrix: scipy.sparse._csc.csc_matrix, row_names: Union[numpy.ndarray, pandas.core.indexes.base.Index], col_names: Union[numpy.ndarray, pandas.core.indexes.base.Index]) None ¶
Convert
csc_matrix
tocoo_matrix
chunkwise and ingest into TileDB.- Parameters
uri – TileDB URI of the array to be written.
matrix –
csc_matrix
.row_names – List of row names.
col_names – List of column names.
- ingest_data_dense_rows_chunked(matrix: numpy.ndarray, row_names: Union[numpy.ndarray, pandas.core.indexes.base.Index], col_names: Union[numpy.ndarray, pandas.core.indexes.base.Index]) None ¶
Convert dense matrix to
coo_matrix
chunkwise and ingest into TileDB.- Parameters
uri – TileDB URI of the array to be written.
matrix – dense matrix.
row_names – List of row names.
col_names – List of column names.
- ingest_data_rows_chunked(matrix: scipy.sparse._csr.csr_matrix, row_names: Union[numpy.ndarray, pandas.core.indexes.base.Index], col_names: Union[numpy.ndarray, pandas.core.indexes.base.Index]) None ¶
Convert
csr_matrix
tocoo_matrix
chunkwise and ingest into TileDB.- Parameters
uri – TileDB URI of the array to be written.
matrix –
csr_matrix
.row_names – List of row names.
col_names – List of column names.
- ingest_data_whole(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], row_names: Union[numpy.ndarray, pandas.core.indexes.base.Index], col_names: Union[numpy.ndarray, pandas.core.indexes.base.Index]) None ¶
Convert
numpy.ndarray
,scipy.sparse.csr_matrix
, orscipy.sparse.csc_matrix
to COO matrix and ingest into TileDB.- Parameters
matrix – Matrix-like object coercible to a scipy COO matrix.
row_names – List of row names.
col_names – List of column names.
- shape() Tuple[int, int] ¶
Returns a tuple with the number of rows and number of columns of the
AssayMatrix
. In TileDB storage, these are string-indexed sparse arrays for which no.shape()
exists, but, we draw from the appropriateobs
,var
,raw/var
, etc. as appropriate for a given matrix.Note: currently implemented via data scan — will be optimized for TileDB core 2.10.
- to_csr_matrix(row_labels: Union[Sequence[str], pandas.core.indexes.base.Index], col_labels: Union[Sequence[str], pandas.core.indexes.base.Index]) scipy.sparse._csr.csr_matrix ¶
Reads the TileDB array storage for the storage and returns a sparse CSR matrix. The row/columns labels should be
obs,var
labels if theAssayMatrix
isX
, orobs,obs
labels if theAssayMatrix
isobsp
, orvar,var
labels if theAssayMatrix
isvarp
. Note in all cases that TileDB will have sorted the row and column labels; they won’t be in the same order as they were in any anndata object which was used to create the TileDB storage.
- class tiledbsoma.AnnotationDataFrame(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Nominally for
obs
andvar
data within a soma. These have one string dimension, and multiple attributes.- df(ids: Optional[Union[List[str], List[bytes]]] = None, attrs: Optional[Sequence[str]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table] ¶
Keystroke-saving alias for
.dim_select()
. Ifids
are provided, they’re used to subselect; if not, the entire dataframe is returned. Ifattrs
are provided, they’re used for the query; else, all attributes are returned.
- dim_select(ids: Optional[Union[List[str], List[bytes]]], attrs: Optional[Sequence[str]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table] ¶
Selects a slice out of the dataframe with specified
obs_ids
(forobs
) orvar_ids
(forvar
). Ifids
isNone
, the entire dataframe is returned. Similarly, ifattrs
are provided, they’re used for the query; else, all attributes are returned.
- from_dataframe(dataframe: pandas.core.frame.DataFrame, *, extent: int = 2048, schema_only: bool = False) None ¶
Populates the
obs
orvar
subgroup for a SOMA object.- Parameters
dataframe –
anndata.obs
,anndata.var
,anndata.raw.var
.extent – TileDB
extent
parameter for the array schema.
- keys() Sequence[str] ¶
Returns the column names for the
obs
orvar
dataframe. For obs and varp,.keys()
is a keystroke-saver for the more general array-schema accessorattr_names
.
- query(query_string: Optional[str], ids: Optional[Union[List[str], List[bytes]]] = None, attrs: Optional[Sequence[str]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table] ¶
Selects from obs/var using a TileDB-Py
QueryCondition
string such ascell_type == "blood"
. Ifattrs
isNone
, returns all column names in the dataframe; use[]
forattrs
to select none of them. Any column names specified in thequery_string
must be included inattrs
ifattrs
is notNone
. ReturnsNone
if the slice is empty.
- class tiledbsoma.AnnotationMatrixGroup(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Nominally for soma
obsm
andvarm
. You can find element names usingsoma.obsm.keys()
; you access elements usingsoma.obsm['X_pca']
etc., orsoma.obsm.X_pca
if you prefer. (The latter syntax is possible when the element name doesn’t have dashes, dots, etc. in it.)- add_matrix_from_matrix_and_dim_values(matrix: Union[pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], dim_values: Union[Sequence[str], pandas.core.indexes.base.Index], matrix_name: str, *, schema_only: bool = False) None ¶
Populates a component of the
obsm
orvarm
subgroup for a SOMA object.- Parameters
matrix – element of anndata.obsm, anndata.varm, or anndata.raw.varm.
dim_values – anndata.obs_names, anndata.var_names, or anndata.raw.var_names.
matrix_name – name of the matrix, like
"X_tsne"
or"PCs"
.
- keys() Sequence[str] ¶
For
obsm
andvarm
,.keys()
is a keystroke-saver for the more general group-member accessor.get_member_names()
.
- class tiledbsoma.AnnotationMatrix(uri: str, name: str, dim_name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Nominally for
obsm
andvarm
group elements within a SOMA.- df(ids: Optional[Union[List[str], List[bytes]]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table] ¶
Keystroke-saving alias for
.dim_select()
. Ifids
are provided, they’re used to subselect; if not, the entire dataframe is returned.
- dim_select(ids: Optional[Union[List[str], List[bytes]]] = None, *, return_arrow: bool = False) Union[pandas.core.frame.DataFrame, pyarrow.lib.Table] ¶
Selects a slice out of the array with specified
obs_ids
(forobsm
elements) orvar_ids
(forvarm
elements). Ifids
isNone
, the entire array is returned.
- from_matrix_and_dim_values(matrix: Union[pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], dim_values: Union[Sequence[str], pandas.core.indexes.base.Index], schema_only: bool = False) None ¶
Populates an array in the obsm/ or varm/ subgroup for a SOMA object.
- Parameters
matrix –
anndata.obsm['foo']
,anndata.varm['foo']
, oranndata.raw.varm['foo']
.dim_values –
anndata.obs_names
,anndata.var_names
, oranndata.raw.var_names
.
- shape() Tuple[int, int] ¶
Returns a tuple with the number of rows and number of columns of the
AnnotationMatrix
. The row-count is the number of obs_ids (forobsm
elements) or the number of var_ids (forvarm
elements). The column-count is the number of columns/attributes in the dataframe.Note: currently implemented via data scan — will be optimized in an upcoming TileDB Core release.
- class tiledbsoma.AnnotationPairwiseMatrixGroup(uri: str, name: str, row_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, col_dataframe: tiledbsoma.annotation_dataframe.AnnotationDataFrame, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Nominally for SOMA
obsp
andvarp
. You can find element names usingsoma.obsp.keys()
; you access elements usingsoma.obsp['distances']
etc., or soma.obsp.distances if you prefer. (The latter syntax is possible when the element name doesn’t have dashes, dots, etc. in it.)- add_matrix_from_matrix_and_dim_values(matrix: Union[numpy.ndarray, scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix], dim_values: Union[Sequence[str], pandas.core.indexes.base.Index], matrix_name: str, *, schema_only: bool = False) None ¶
Populates a component of the
obsp
orvarp
subgroup for a SOMA object.- Parameters
matrix – element of anndata.obsp or anndata.varp.
dim_values – anndata.obs_names or anndata.var_names.
matrix_name_name – name of the matrix, like
"distances"
.
- keys() Sequence[str] ¶
For obsp and varp,
.keys()
is a keystroke-saver for the more general group-member accessor.get_member_names()
.
- remove(matrix_name: str) None ¶
Removes a component of the
obsp
orvarp
subgroup for a SOMA object. Implementsdel soma.obsp['distances']
etc.
- to_dict_of_csr(obs_df_index: Union[Sequence[str], pandas.core.indexes.base.Index], var_df_index: Union[Sequence[str], pandas.core.indexes.base.Index]) Dict[str, scipy.sparse._csr.csr_matrix] ¶
Reads the
obsp
orvarp
group-member arrays into a dict from name to member array. Member arrays are returned in sparse CSR format.
- class tiledbsoma.RawGroup(uri: str, name: str, obs: tiledbsoma.annotation_dataframe.AnnotationDataFrame, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Nominally for soma raw.
- from_anndata(anndata: anndata._core.anndata.AnnData, X_layer_name: str = 'data', *, schema_only: bool = False) None ¶
Writes
anndata.raw
to a TileDB group structure.
- to_anndata_raw(obs_labels: Union[Sequence[str], pandas.core.indexes.base.Index], X_layer_name: str = 'data') Tuple[scipy.sparse._csr.csr_matrix, pandas.core.frame.DataFrame, Dict[str, numpy.ndarray]] ¶
Reads TileDB storage and returns the material for an
anndata.Raw
object. Theobs_labels
must be from the parent object.
- class tiledbsoma.UnsGroup(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Nominally for soma uns.
- from_anndata_uns(uns: Mapping[str, Any]) None ¶
Populates the uns group for the soma object.
- Parameters
uns – anndata.uns.
- class tiledbsoma.UnsArray(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Holds TileDB storage for an array obtained from the nested
anndata.uns
field.- create_empty_array_for_csr(attr_name: str, matrix_dtype: numpy.dtype, nrows: int, ncols: int) None ¶
Create a TileDB 2D sparse array with int dimensions and a single attribute. Nominally used for uns data.
- Parameters
matrix_dtype – datatype of the matrix
nrows – number of rows in the matrix
ncols – number of columns in the matrix
- from_numpy_ndarray(arr: numpy.ndarray) None ¶
Writes a
numpy.ndarray
to a TileDB array, nominally for ingest ofuns
nested data from anndata objects. Mostlytiledb.from_numpy
, but with some necessary handling for data with UTF-8 values.
- from_pandas_dataframe(df: pandas.core.frame.DataFrame) None ¶
Ingests an
UnsArray
into TileDB storage, given a pandas.DataFrame.
- from_scipy_csr(csr: scipy.sparse._csr.csr_matrix) None ¶
Convert ndarray/(csr|csc)matrix to coo_matrix and ingest into TileDB.
- Parameters
csr – Matrix-like object coercible to a scipy coo_matrix.
- ingest_data_from_csr(csr: scipy.sparse._csr.csr_matrix) None ¶
Convert ndarray/(csr|csc)matrix to coo_matrix and ingest into TileDB.
- Parameters
csr – Matrix-like object coercible to a scipy coo_matrix.
- to_matrix() numpy.ndarray ¶
Reads an uns array from TileDB storage and returns a matrix – currently, always as
numpy.ndarray
.
Implementation-level classes¶
- class tiledbsoma.TileDBArray(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None)¶
Wraps arrays from TileDB-Py by retaining a URI, options, etc. Also serves as an abstraction layer to hide TileDB-specific details from the API, unless requested.
- attr_names() Sequence[str] ¶
Reads the attribute names from the schema: for example, the list of column names in a dataframe.
- attr_names_to_types() Dict[str, str] ¶
Returns a dict mapping from attribute name to attribute type.
- dim_names() Sequence[str] ¶
Reads the dimension names from the schema: for example, [‘obs_id’, ‘var_id’].
- exists() bool ¶
Tells whether or not there is storage for the array. This might be in case a SOMA object has not yet been populated, e.g. before calling
from_anndata
— or, if the SOMA has been populated but doesn’t have this member (e.g. not all SOMAs have avarp
).
- has_attr_name(attr_name: str) bool ¶
Returns true if the array has the specified attribute name, false otherwise.
- has_attr_names(attr_names: Sequence[str]) bool ¶
Returns true if the array has all of the specified attribute names, false otherwise.
- tiledb_array_schema() tiledb.libtiledb.ArraySchema ¶
Returns the TileDB array schema.
- class tiledbsoma.TileDBGroup(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None, soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, ctx: Optional[tiledb.ctx.Ctx] = None)¶
Wraps groups from TileDB-Py by retaining a URI, options, etc.
- create_unless_exists() None ¶
Creates the TileDB group data structure on disk/S3/cloud, unless it already exists.
- exists() bool ¶
Tells whether or not there is storage for the group. This might be in case a SOMA object has not yet been populated, e.g. before calling
from_anndata
– or, if the SOMA has been populated but doesn’t have this member (e.g. not all SOMAs have avarp
).
- get_member_names() Sequence[str] ¶
Returns the names of the group elements. For a SOMACollection, these will SOMA names; for a SOMA, these will be matrix/group names; etc.
- get_member_names_to_uris() Dict[str, str] ¶
Like
get_member_names()
andget_member_uris
, but returns a dict mapping from member name to member URI.
- class tiledbsoma.TileDBObject(uri: str, name: str, *, parent: Optional[tiledbsoma.tiledb_group.TileDBGroup] = None, soma_options: Optional[tiledbsoma.soma_options.SOMAOptions] = None, ctx: Optional[tiledb.ctx.Ctx] = None)¶
Base class for
TileDBArray
andTileDBGroup
.Manages
soma_options
,ctx
, etc. which are common to both.
- class tiledbsoma.util.ETATracker¶
Computes estimated time to completion for chunked writes.
- ingest_and_predict(chunk_percent: float, chunk_seconds: float) str ¶
Updates from most recent chunk percent-done and chunk completion-seconds, then does a linear regression on all chunks done so far and estimates time to completion. :param chunk_percent: a percent done like 6.1 or 10.3. :param chunk_seconds: number of seconds it took to do the current chunk operation.
- tiledbsoma.util.X_and_ids_to_sparse_matrix(Xdf: pandas.core.frame.DataFrame, row_dim_name: str, col_dim_name: str, attr_name: str, row_labels: Union[Sequence[str], pandas.core.indexes.base.Index], col_labels: Union[Sequence[str], pandas.core.indexes.base.Index], return_as: str = 'csr') Union[scipy.sparse._csr.csr_matrix, scipy.sparse._csc.csc_matrix] ¶
This is needed when we read a TileDB X.df[:]. Since TileDB X is sparse 2D string-dimensioned, the return value of which is a dict with three columns – obs_id, var_id, and value. For conversion to anndata, we need make a sparse COO/IJV-format array where the indices are not strings but ints, matching the obs and var labels. The
return_as
parameter must be one of"csr"
or"csc"
.
- tiledbsoma.util.format_elapsed(start_stamp: float, message: str) str ¶
Returns the message along with an elapsed-time indicator, with end time relative to start start from
get_start_stamp
. Used for annotating elapsed time of a task.
- tiledbsoma.util.get_start_stamp() float ¶
Returns information about start time of an event. Nominally float seconds since the epoch, but articulated here as being compatible with the format_elapsed function.
- tiledbsoma.util.is_local_path(path: str) bool ¶
Returns information about start time of an event. Nominally float seconds since the epoch, but articulated here as being compatible with the format_elapsed function.
- tiledbsoma.util.is_soma(uri: str, ctx: Optional[tiledb.ctx.Ctx] = None) bool ¶
Tells whether the URI points to a SOMA or not.
- tiledbsoma.util.is_soma_collection(uri: str, ctx: Optional[tiledb.ctx.Ctx] = None) bool ¶
Tells whether the URI points to a SOMACollection or not.
- tiledbsoma.util.triples_to_dense_df(sparse_df: pandas.core.frame.DataFrame, fillna: float = 0.0) pandas.core.frame.DataFrame ¶
Output from X dataframe reads is in “triples” format, e.g. two index columns
obs_id
andvar_id
, and data columnvalue
. This is the default format, and is appropriate for large, possibly sparse matrices. However, sometimes we want a dense matrix withobs_id
row labels,var_id
column labels, andvalue
data. This function produces that.
- tiledbsoma.util_ann.describe_ann_file(input_path: Union[str, pathlib.Path], show_summary: bool = True, show_types: bool = False, show_data: bool = False) None ¶
This is an anndata-describer that goes a bit beyond what
h5ls
does for us. In particular, it shows us that for one HDF5 file we haveanndata.X
being of typenumpy.ndarray
while for another HDF5 file we haveanndata.X
being of typescipy.sparse.csr.csr_matrix
. This is crucial information for building I/O logic that accepts a diversity of anndata HDF5 files.
- tiledbsoma.util_tiledb.show_soma_schemas(soma_uri: str, ctx: Optional[tiledb.ctx.Ctx] = None) None ¶
Show some summary information about an ingested TileDB Single-Cell Group. This tool goes a bit beyond
print(tiledb.group.Group(soma_uri))
by also revealing array schema. Additionally, by employing encoded domain-specific knowleldge, it traverses items in the familiar orderX
,obs
,var
, etc. rather than using the general-purpose tiledb-group-display function.
- tiledbsoma.util_tiledb.show_tiledb_group_array_schemas(uri: str, ctx: Optional[tiledb.ctx.Ctx] = None) None ¶
Recursively show array schemas within a TileDB Group. This function is not specific to single-cell matrix-API data, and won’t necessarily traverse items in a familiar application-specific order.