Tutorial: SOMA Objects¶
In this notebook, we’ll go through the various objects available as part of the SOMA API. The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics.
We’ll start by importing tiledbsoma
.
[1]:
import tiledbsoma
Experiment¶
An Experiment
is a class that represents a single-cell experiment. It always contains two objects: 1. obs
: A DataFrame
with primary annotations on the observation axis. 2. ms
: A Collection
of measurements.
Let’s unpack and open the experiment:
[2]:
import tarfile
import tempfile
dense_uri = tempfile.mktemp()
with tarfile.open("data/pbmc3k-dense.tgz") as handle:
handle.extractall(dense_uri)
experiment = tiledbsoma.Experiment.open(dense_uri)
Each object within the experiment can be opened like this:
[3]:
experiment.ms
[3]:
<Collection 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms' (open for 'r') (2 items)
'RNA': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/RNA' (unopened)
'raw': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/raw' (unopened)>
[4]:
experiment.obs
[4]:
<DataFrame 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/obs' (open for 'r')>
Note that by default an experiment is opened lazily, i.e. only the minimal requested objects are opened.
Also, opening an object doesn’t mean that it will entirely be fetched in memory. It only returns a pointer to the object on disk.
DataFrame¶
A DataFrame
is a multi-column table with a user-defined schema. The schema is expressed as an Arrow Schema, and defines the column names and value types.
As an example, let’s take a look at obs
, which is represented as a SOMA DataFrame.
We can inspect the schema using .schema
:
[5]:
obs = experiment.obs
The domain
is the bounds within which data can be read or written – currently, soma_joinids
from 0 to 2637, inclusive. This can be resized later, as we’ll see in the notebook that explains TileDB-SOMA’s append mode.
[8]:
obs.domain
[8]:
((0, 2637),)
[9]:
obs.schema
[9]:
soma_joinid: int64 not null
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
Note that soma_joinid
is a field that exists in each DataFrame
and acts as a join key for other objects, such as SparseNDArray
(more on this later).
When a DataFrame
is accessed, only metadata is retrieved, not actual data. This is important since a DataFrame can be very large and might not fit in memory.
To materialize the dataframe (or a subset) in memory, we call df.read()
.
If the dataframe is small, we can convert it to an in-memory Pandas object like this:
[10]:
obs.read().concat().to_pandas()
[10]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 0 | AAACATACAACCAC-1 | 781 | 0.030178 | 2419.0 | CD4 T cells |
1 | 1 | AAACATTGAGCTAC-1 | 1352 | 0.037936 | 4903.0 | B cells |
2 | 2 | AAACATTGATCAGC-1 | 1131 | 0.008897 | 3147.0 | CD4 T cells |
3 | 3 | AAACCGTGCTTCCG-1 | 960 | 0.017431 | 2639.0 | CD14+ Monocytes |
4 | 4 | AAACCGTGTATGCG-1 | 522 | 0.012245 | 980.0 | NK cells |
... | ... | ... | ... | ... | ... | ... |
2633 | 2633 | TTTCGAACTCTCAT-1 | 1155 | 0.021104 | 3459.0 | CD14+ Monocytes |
2634 | 2634 | TTTCTACTGAGGCA-1 | 1227 | 0.009294 | 3443.0 | B cells |
2635 | 2635 | TTTCTACTTCCTCG-1 | 622 | 0.021971 | 1684.0 | B cells |
2636 | 2636 | TTTGCATGAGAGGC-1 | 454 | 0.020548 | 1022.0 | B cells |
2637 | 2637 | TTTGCATGCCTCAC-1 | 724 | 0.008065 | 1984.0 | CD4 T cells |
2638 rows × 6 columns
Here, read()
returns an iterator, concat()
materializes all rows to memory and to_pandas()
returns a Pandas view of the dataframe.
If the dataframe is bigger, we can only select a subset of it before materializing. This will only retrieve the required subset from disk to memory, so very large dataframes can be queried this way. In this example, we will only select the first 10 rows:
[11]:
obs.read((slice(0,10),)).concat().to_pandas()
[11]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 0 | AAACATACAACCAC-1 | 781 | 0.030178 | 2419.0 | CD4 T cells |
1 | 1 | AAACATTGAGCTAC-1 | 1352 | 0.037936 | 4903.0 | B cells |
2 | 2 | AAACATTGATCAGC-1 | 1131 | 0.008897 | 3147.0 | CD4 T cells |
3 | 3 | AAACCGTGCTTCCG-1 | 960 | 0.017431 | 2639.0 | CD14+ Monocytes |
4 | 4 | AAACCGTGTATGCG-1 | 522 | 0.012245 | 980.0 | NK cells |
5 | 5 | AAACGCACTGGTAC-1 | 782 | 0.016644 | 2163.0 | CD8 T cells |
6 | 6 | AAACGCTGACCAGT-1 | 783 | 0.038161 | 2175.0 | CD8 T cells |
7 | 7 | AAACGCTGGTTCTT-1 | 790 | 0.030973 | 2260.0 | CD8 T cells |
8 | 8 | AAACGCTGTAGCCA-1 | 533 | 0.011765 | 1275.0 | CD4 T cells |
9 | 9 | AAACGCTGTTTCTG-1 | 550 | 0.029012 | 1103.0 | FCGR3A+ Monocytes |
10 | 10 | AAACTTGAAAAACG-1 | 1116 | 0.026316 | 3914.0 | B cells |
We can also select a subset of the columns:
[12]:
obs.read((slice(0, 10),), column_names=["obs_id", "n_genes"]).concat().to_pandas()
[12]:
obs_id | n_genes | |
---|---|---|
0 | AAACATACAACCAC-1 | 781 |
1 | AAACATTGAGCTAC-1 | 1352 |
2 | AAACATTGATCAGC-1 | 1131 |
3 | AAACCGTGCTTCCG-1 | 960 |
4 | AAACCGTGTATGCG-1 | 522 |
5 | AAACGCACTGGTAC-1 | 782 |
6 | AAACGCTGACCAGT-1 | 783 |
7 | AAACGCTGGTTCTT-1 | 790 |
8 | AAACGCTGTAGCCA-1 | 533 |
9 | AAACGCTGTTTCTG-1 | 550 |
10 | AAACTTGAAAAACG-1 | 1116 |
Finally, we can use value_filter
to retrieve a filtered subset of rows that match a certain condition.
[13]:
obs.read((slice(None),), value_filter="n_genes > 1500").concat().to_pandas()
[13]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 26 | AAATCAACCCTATT-1 | 1545 | 0.024313 | 5676.0 | CD4 T cells |
1 | 59 | AACCTACTGTGAGG-1 | 1652 | 0.015839 | 5682.0 | CD14+ Monocytes |
2 | 107 | AAGCACTGGTTCTT-1 | 1717 | 0.023566 | 6153.0 | B cells |
3 | 109 | AAGCCATGAACTGC-1 | 1877 | 0.014015 | 7064.0 | Dendritic cells |
4 | 247 | ACCCAGCTGTTAGC-1 | 1547 | 0.020600 | 5534.0 | CD14+ Monocytes |
... | ... | ... | ... | ... | ... | ... |
70 | 2508 | TTACTCGACGCAAT-1 | 1603 | 0.024851 | 5030.0 | Dendritic cells |
71 | 2530 | TTATGGCTTATGGC-1 | 1783 | 0.022064 | 6164.0 | Dendritic cells |
72 | 2597 | TTGAGGACTACGCA-1 | 1794 | 0.024440 | 6342.0 | Dendritic cells |
73 | 2623 | TTTAGCTGTACTCT-1 | 1567 | 0.021160 | 5671.0 | Dendritic cells |
74 | 2632 | TTTCGAACACCTGA-1 | 1544 | 0.013019 | 4455.0 | Dendritic cells |
75 rows × 6 columns
Collection¶
A Collection
is a persistent container of named SOMA objects, stored as a mapping of string keys and SOMA object values.
The ms
member in an Experiment is implemented as a Collection. Let’s take a look:
[14]:
experiment.ms
[14]:
<Collection 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms' (open for 'r') (2 items)
'RNA': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/RNA' (unopened)
'raw': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/raw' (unopened)>
In this case, we have two members: raw
and test_exp_name
. They can be accessed as they were dict members:
[15]:
experiment.ms["raw"]
[15]:
<Measurement 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/raw' (open for 'r') (2 items)
'X': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/raw/X' (unopened)
'var': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/raw/var' (unopened)>
DenseNDArray¶
A DenseNDArray
is a dense, N-dimensional array, with offset (zero-based) integer indexing on each dimension.
DenseNDArray
has a user-defined schema, which includes: - the element type, expressed as an Arrow type, indicating the type of data contained within the array, and - the shape of the array, i.e., the number of dimensions and the length of each dimension
In a SOMA single cell experiment, the cell by gene matrix X is typically represented either by DenseNDArray
or SparseNDArray
. Let’s take a look at our example:
[16]:
X = experiment["ms"]["RNA"].X
X
[16]:
<Collection 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/RNA/X' (open for 'r') (1 item)
'data': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/RNA/X/data' (unopened)>
Within the experiment, X
is a Collection
and the data can be accessed using ["data"]
:
[17]:
X = X["data"]
X
[17]:
<DenseNDArray 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmp_hz2hc6p/ms/RNA/X/data' (open for 'r')>
We can inspect the DenseNDArray
and get useful information by using .schema
:
[18]:
X.schema
[18]:
soma_dim_0: int64 not null
soma_dim_1: int64 not null
soma_data: float not null
In this case, we see there are two dimensions and the data is of type float
.
As with with domain for obs
, this has a shape: the boundary within which data can be read or written. This may be resizable later as we’ll see in the notebook on TileDB-SOMA’s append mode.
[19]:
X.shape
[19]:
(2638, 1838)
Similarly to DataFrame
, when opening a DenseNDArray
only metadata is fetched, and the array isn’t fetched into memory.
We can convert the matrix into a pyarrow.Tensor
using .read()
:
[21]:
X.read()
[21]:
<pyarrow.Tensor>
type: float
shape: (2638, 1838)
strides: (7352, 4)
From here, we can convert it further to a numpy.ndarray
:
[22]:
X.read().to_numpy()
[22]:
array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
-0.2090951 , -0.5312034 ],
[-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,
-0.31314576, -0.5966544 ],
[-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
-0.17087643, 1.379 ],
...,
[-0.2070895 , -0.250464 , -0.046397 , ..., -0.05114426,
-0.16106427, 2.0414972 ],
[-0.19032837, -0.2263336 , -0.04399938, ..., -0.00591773,
-0.13521303, -0.48211113],
[-0.33378917, -0.2535875 , -0.05271563, ..., -0.07842438,
-0.13032717, -0.4713379 ]], dtype=float32)
This will only work on small matrices, since a numpy
array needs to be in memory.
We can retrieve a subset of the matrix passing coordinates to .read()
. Here we’re only retrieving the first 10 rows of the matrix:
[23]:
sliced_X = X.read((slice(0,9),)).to_numpy()
sliced_X
[23]:
array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
-0.2090951 , -0.5312034 ],
[-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,
-0.31314576, -0.5966544 ],
[-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
-0.17087643, 1.379 ],
...,
[-0.15813293, -0.27562705, -0.04569191, ..., -0.08687588,
-0.2062048 , 1.6869122 ],
[ 4.861763 , -0.23054866, -0.04826924, ..., -0.02755091,
-0.11788268, -0.4664504 ],
[-0.12453113, -0.23373608, -0.04131226, ..., -0.00758654,
-0.16255915, -0.50339466]], dtype=float32)
[24]:
sliced_X.shape
[24]:
(10, 1838)
Note that DenseNDArray
is always indexed, on each dimension, using zero-based integers. If this dimension matches any other object in the experiment, the soma_joinid
column can be used to retrieve the correct slice.
In the following example, we will get the values of X for the gene tagged as ICOSLG
. This involves reading the var
DataFrame using a value_filter
, retrieving the soma_joinid
for the gene and passing it as coordinate to X.read
:
[25]:
var = experiment.ms["RNA"].var
idx = var.read(value_filter="var_id == 'ICOSLG'").concat()["soma_joinid"].to_numpy()
X.read((None, int(idx[0]))).to_numpy()
[25]:
array([[-0.12167774],
[-0.05866209],
[-0.07043106],
...,
[-0.1320983 ],
[-0.14978862],
[-0.10383061]], dtype=float32)
SparseNDArray¶
A SparseNDArray
is a sparse, N-dimensional array, with offset (zero-based) integer indexing on each dimension. SparseNDArray
has a user-defined schema, which includes: - the element type, expressed as an Arrow type, indicating the type of data contained within the array, and - the shape of the array, i.e., the number of dimensions and the length of each dimension
A SparseNDArray
is functionally similar to a DenseNDArray
, except that only elements that have a nonzero value are actually stored. Elements that are not explicitly stored are assumed to be zeros.
As an example, we will load a version of pbmc3k that has been generated using a SparseNDArray
:
[26]:
import tarfile
import tempfile
sparse_uri = tempfile.mktemp()
with tarfile.open("data/pbmc3k-sparse.tgz") as handle:
handle.extractall(sparse_uri)
experiment = tiledbsoma.Experiment.open(sparse_uri)
Let’s take a look at the schema:
[27]:
X = experiment.ms["RNA"].X["data"]
X.schema
[27]:
soma_dim_0: int64 not null
soma_dim_1: int64 not null
soma_data: float not null
This is the same as the DenseNDArray
version, which makes sense since it’s still a 2-dimensional matrix with float
data.
Let’s look at the shape:
[28]:
X.shape
[28]:
(2638, 1838)
This too has a shape: the boundary within which data can be read or written. This can be resized later as we’ll see in the notebook on TileDB-SOMA’s append mode.
We can get the number of nonzero elements by calling .nnz
:
[29]:
X.nnz
[29]:
4848644
In order to work with a SparseNDArray
, we call .read()
:
[30]:
X.read()
[30]:
<tiledbsoma._sparse_nd_array.SparseNDArrayRead at 0x11e824f10>
This returns a SparseNDArrayRead that can be used for getting iterators. For instance, we can do:
[31]:
tensor = X.read().coos().concat()
This returns an Arrow Tensor that can be used to access the array, or convert it further to different formats. For instance:
[32]:
tensor.to_scipy()
[32]:
<COOrdinate sparse matrix of dtype 'float32'
with 4848644 stored elements and shape (2638, 1838)>
can be used to transform it to a SciPy coo_matrix.
Similarly to DenseNDArray
s, we can call .read()
with a slice to only obtain a subset of the matrix. As an example:
[33]:
sliced_X = X.read((slice(0,9),)).coos().concat().to_scipy()
sliced_X
[33]:
<COOrdinate sparse matrix of dtype 'float32'
with 18380 stored elements and shape (2638, 1838)>
Let’s verify that the slice is correct. To do that, we can call nonzero()
on the scipy.sparse.coo_matrix
to obtain the coordinates of the nonzero items, and look at the coordinates over the first dimension:
[34]:
sliced_X.nonzero()[0]
[34]:
array([0, 0, 0, ..., 9, 9, 9], dtype=int32)
[ ]: