Tutorial: Reading SOMA Objects¶
In this notebook we’ll learn how to read from various SOMA objects. We will assume familiarity with SOMA objects already, so it is recommended to go through the Tutorial: SOMA Objects before.
This implementation of SOMA relies on TileDB, which is a storage format that allows working with large files without having to fully load them in memory. Files can be either read from disk or from a remote source, like an S3 bucket.
The core feature of SOMA is to allow reading subsets of the data using slices: only the portion of required data is read from disk/network. SOMA uses Apache Arrow as an intermediate in-memory storage. From here, the slices can be further converted into more familiar formats, like a scipy.sparse matrix or a numpy ndarray. Consult the Python bindings for Apache Arrow documentation for more information.
In this notebook, we will use the Peripheral Blood Mononuclear Cells (PBMC) dataset. We will focus on reading from its obs
DataFrame
and from the X
SparseNDArray
. This is a small dataset that can fit in memory, but we’ll focus on operations that work on subsets of data that will work on larger datasets as well.
Reading a DataFrame¶
Introduction¶
[1]:
import tiledbsoma
[2]:
import tarfile
import tempfile
sparse_uri = tempfile.mktemp()
with tarfile.open("data/pbmc3k-sparse.tgz") as handle:
handle.extractall(sparse_uri)
experiment = tiledbsoma.Experiment.open(sparse_uri)
All read operations need to be performed using the .read()
method. For a DataFrame
, we want to then call .concat()
to obtain a PyArrow Table:
[3]:
obs = experiment.obs
table = obs.read().concat()
table
[3]:
pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
----
soma_joinid: [[0,1,2,3,4,...,2633,2634,2635,2636,2637]]
obs_id: [["AAACATACAACCAC-1","AAACATTGAGCTAC-1","AAACATTGATCAGC-1","AAACCGTGCTTCCG-1","AAACCGTGTATGCG-1",...,"TTTCGAACTCTCAT-1","TTTCTACTGAGGCA-1","TTTCTACTTCCTCG-1","TTTGCATGAGAGGC-1","TTTGCATGCCTCAC-1"]]
n_genes: [[781,1352,1131,960,522,...,1155,1227,622,454,724]]
percent_mito: [[0.030177759,0.037935957,0.008897362,0.017430846,0.012244898,...,0.021104366,0.00929422,0.021971496,0.020547945,0.008064516]]
n_counts: [[2419,4903,3147,2639,980,...,3459,3443,1684,1022,1984]]
louvain: [ -- dictionary:
["CD4 T cells","CD14+ Monocytes","B cells","CD8 T cells","NK cells","FCGR3A+ Monocytes","Dendritic cells","Megakaryocytes"] -- indices:
[0,2,0,1,4,...,1,2,2,2,0]]
From here, we can directly use any of the PyArrow Table methods, for instance:
[4]:
table.sort_by([("n_genes", "descending")])
[4]:
pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
----
soma_joinid: [[270,1163,1891,926,277,...,2186,1522,662,1288,1840]]
obs_id: [["ACGAACTGGCTATG-1","CGATACGACAGGAG-1","GGGCCAACCTTGGA-1","CAGGTTGAGGATCT-1","ACGAGGGACAGGAG-1",...,"TAGTCTTGGCTGTA-1","GACGCTCTCTCTCG-1","ATCTCAACCTCGAA-1","CTAATAGAGCTATG-1","GGCATATGGGGAGT-1"]]
n_genes: [[2455,2033,2020,2000,1997,...,270,267,246,239,212]]
percent_mito: [[0.015774649,0.022166021,0.010576352,0.026962927,0.014631685,...,0,0.032258064,0,0.0016666667,0.012173913]]
n_counts: [[8875,6722,8415,8011,7928,...,652,682,609,600,575]]
louvain: [ -- dictionary:
["CD4 T cells","CD14+ Monocytes","B cells","CD8 T cells","NK cells","FCGR3A+ Monocytes","Dendritic cells","Megakaryocytes"] -- indices:
[7,0,6,2,6,...,0,7,0,0,7]]
Alternatively, we can convert the DataFrame
to a different format, like a Pandas DataFrame:
[5]:
table.to_pandas()
[5]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 0 | AAACATACAACCAC-1 | 781 | 0.030178 | 2419.0 | CD4 T cells |
1 | 1 | AAACATTGAGCTAC-1 | 1352 | 0.037936 | 4903.0 | B cells |
2 | 2 | AAACATTGATCAGC-1 | 1131 | 0.008897 | 3147.0 | CD4 T cells |
3 | 3 | AAACCGTGCTTCCG-1 | 960 | 0.017431 | 2639.0 | CD14+ Monocytes |
4 | 4 | AAACCGTGTATGCG-1 | 522 | 0.012245 | 980.0 | NK cells |
... | ... | ... | ... | ... | ... | ... |
2633 | 2633 | TTTCGAACTCTCAT-1 | 1155 | 0.021104 | 3459.0 | CD14+ Monocytes |
2634 | 2634 | TTTCTACTGAGGCA-1 | 1227 | 0.009294 | 3443.0 | B cells |
2635 | 2635 | TTTCTACTTCCTCG-1 | 622 | 0.021971 | 1684.0 | B cells |
2636 | 2636 | TTTGCATGAGAGGC-1 | 454 | 0.020548 | 1022.0 | B cells |
2637 | 2637 | TTTGCATGCCTCAC-1 | 724 | 0.008065 | 1984.0 | CD4 T cells |
2638 rows × 6 columns
Reading slices of data¶
As previously mentioned, the core feature of SOMA is reading slices of the data without fetching the whole dataset in memory. To do that, the .read()
method supports a coords
parameter that allows data slicing.
Before we do that, let’s take a look at the schema of the obs
dataframe:
[6]:
obs.schema
[6]:
soma_joinid: int64 not null
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
And also its domain:
[7]:
obs.domain
[7]:
((0, 2637),)
With a SOMA DataFrame, you can only slice across an indexed column, so let’s look at the indexed columns:
[8]:
obs.index_column_names
[8]:
('soma_joinid',)
In this case our index consists of just soma_joinid
, which is an integer column that can be used to join other SOMA objects in the same experiment.
Let’s look at a few ways to slice the dataframe.
Select a single row¶
[9]:
obs.read([[0]]).concat().to_pandas()
[9]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 0 | AAACATACAACCAC-1 | 781 | 0.030178 | 2419.0 | CD4 T cells |
Select multiple, non contiguous rows¶
[10]:
obs.read([[2, 5]]).concat().to_pandas()
[10]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 2 | AAACATTGATCAGC-1 | 1131 | 0.008897 | 3147.0 | CD4 T cells |
1 | 5 | AAACGCACTGGTAC-1 | 782 | 0.016644 | 2163.0 | CD8 T cells |
Select a slice of rows¶
[11]:
obs.read([slice(0, 5)]).concat().to_pandas()
[11]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 0 | AAACATACAACCAC-1 | 781 | 0.030178 | 2419.0 | CD4 T cells |
1 | 1 | AAACATTGAGCTAC-1 | 1352 | 0.037936 | 4903.0 | B cells |
2 | 2 | AAACATTGATCAGC-1 | 1131 | 0.008897 | 3147.0 | CD4 T cells |
3 | 3 | AAACCGTGCTTCCG-1 | 960 | 0.017431 | 2639.0 | CD14+ Monocytes |
4 | 4 | AAACCGTGTATGCG-1 | 522 | 0.012245 | 980.0 | NK cells |
5 | 5 | AAACGCACTGGTAC-1 | 782 | 0.016644 | 2163.0 | CD8 T cells |
Select a subset of columns only¶
[12]:
obs.read([slice(0, 5)], column_names=["obs_id", "louvain"]).concat().to_pandas()
[12]:
obs_id | louvain | |
---|---|---|
0 | AAACATACAACCAC-1 | CD4 T cells |
1 | AAACATTGAGCTAC-1 | B cells |
2 | AAACATTGATCAGC-1 | CD4 T cells |
3 | AAACCGTGCTTCCG-1 | CD14+ Monocytes |
4 | AAACCGTGTATGCG-1 | NK cells |
5 | AAACGCACTGGTAC-1 | CD8 T cells |
Filter data using complex queries¶
SOMA also allows to filter data using more complex queries. For a more detailed reference, take a look at the query condition source code.
Here are a few examples:
Filter all cells with a Louvain categorization of “B cells”¶
[13]:
obs.read(value_filter="louvain == 'B cells'").concat().to_pandas()
[13]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 1 | AAACATTGAGCTAC-1 | 1352 | 0.037936 | 4903.0 | B cells |
1 | 10 | AAACTTGAAAAACG-1 | 1116 | 0.026316 | 3914.0 | B cells |
2 | 18 | AAAGGCCTGTCTAG-1 | 1446 | 0.015283 | 4973.0 | B cells |
3 | 19 | AAAGTTTGATCACG-1 | 446 | 0.034700 | 1268.0 | B cells |
4 | 20 | AAAGTTTGGGGTGA-1 | 1020 | 0.025907 | 3281.0 | B cells |
... | ... | ... | ... | ... | ... | ... |
337 | 2628 | TTTCAGTGTCACGA-1 | 700 | 0.034314 | 1632.0 | B cells |
338 | 2630 | TTTCAGTGTGCAGT-1 | 637 | 0.018925 | 1321.0 | B cells |
339 | 2634 | TTTCTACTGAGGCA-1 | 1227 | 0.009294 | 3443.0 | B cells |
340 | 2635 | TTTCTACTTCCTCG-1 | 622 | 0.021971 | 1684.0 | B cells |
341 | 2636 | TTTGCATGAGAGGC-1 | 454 | 0.020548 | 1022.0 | B cells |
342 rows × 6 columns
Filter all cells with a Louvain categorization of either “CD4 T cells” or “CD8 T cells”¶
[14]:
obs.read(value_filter="(louvain == 'CD4 T cells') or (louvain == 'CD8 T cells')").concat().to_pandas()
[14]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 0 | AAACATACAACCAC-1 | 781 | 0.030178 | 2419.0 | CD4 T cells |
1 | 2 | AAACATTGATCAGC-1 | 1131 | 0.008897 | 3147.0 | CD4 T cells |
2 | 5 | AAACGCACTGGTAC-1 | 782 | 0.016644 | 2163.0 | CD8 T cells |
3 | 6 | AAACGCTGACCAGT-1 | 783 | 0.038161 | 2175.0 | CD8 T cells |
4 | 7 | AAACGCTGGTTCTT-1 | 790 | 0.030973 | 2260.0 | CD8 T cells |
... | ... | ... | ... | ... | ... | ... |
1455 | 2621 | TTTAGCTGATACCG-1 | 887 | 0.022876 | 2754.0 | CD4 T cells |
1456 | 2626 | TTTCACGAGGTTCA-1 | 721 | 0.013261 | 2036.0 | CD4 T cells |
1457 | 2627 | TTTCAGTGGAAGGC-1 | 692 | 0.015169 | 1780.0 | CD8 T cells |
1458 | 2631 | TTTCCAGAGGTGAG-1 | 873 | 0.006859 | 2187.0 | CD4 T cells |
1459 | 2637 | TTTGCATGCCTCAC-1 | 724 | 0.008065 | 1984.0 | CD4 T cells |
1460 rows × 6 columns
Filter all cells with a Louvain categorization of “CD4 T cells” and more than 1500 genes¶
[15]:
obs.read(value_filter="(louvain == 'CD4 T cells') and (n_genes > 1500)").concat().to_pandas()
[15]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 26 | AAATCAACCCTATT-1 | 1545 | 0.024313 | 5676.0 | CD4 T cells |
1 | 357 | ACTCTCCTGCATAC-1 | 1750 | 0.017436 | 5850.0 | CD4 T cells |
2 | 473 | AGCTGCCTTTCATC-1 | 1703 | 0.029547 | 5212.0 | CD4 T cells |
3 | 945 | CATACTTGGGTTAC-1 | 1938 | 0.023580 | 7167.0 | CD4 T cells |
4 | 1163 | CGATACGACAGGAG-1 | 2033 | 0.022166 | 6722.0 | CD4 T cells |
5 | 1320 | CTATACTGTTCGTT-1 | 1543 | 0.012395 | 4760.0 | CD4 T cells |
6 | 1548 | GAGCATACTTTGCT-1 | 1753 | 0.016739 | 6691.0 | CD4 T cells |
7 | 1993 | GTGATGACAAGTGA-1 | 1819 | 0.021172 | 6329.0 | CD4 T cells |
8 | 2313 | TCGGACCTGTACAC-1 | 1567 | 0.014288 | 5599.0 | CD4 T cells |
9 | 2365 | TGAGACACAAGGTA-1 | 1549 | 0.013242 | 5135.0 | CD4 T cells |
Reading a SparseNDArray¶
For SparseNDArray
, let’s consider the X matrix:
[16]:
X = experiment.ms["RNA"].X["data"]
X
[16]:
<SparseNDArray 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/RNA/X/data' (open for 'r')>
Similarly to DataFrame
, we need to use the .read()
method:
[17]:
X.read()
[17]:
<tiledbsoma._sparse_nd_array.SparseNDArrayRead at 0x119ba4ad0>
In this case, we have two options. Let’s start by converting this into an Arrow SparseCOOTensor:
[18]:
tensor = X.read().coos().concat()
tensor
[18]:
<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)
In this example, we obtain a 2-dimensional tensor (a matrix). Note that shape
here indicates the capacity of the tensor, rather than the actual size.
By default, a SparseNDArray
gets created with a much higher capacity to accommodate further writes. Since this is a read scenario, and the shape of the matrix is known, we can call .coos()
with a parameter so that the SparseNDArray
is resized accordingly:
[19]:
n_obs = len(obs)
n_var = len(experiment.ms["RNA"].var)
tensor = X.read().coos((n_obs, n_var)).concat()
tensor
[19]:
<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)
We can convert this to a scipy.sparse.coo_matrix
:
[20]:
tensor.to_scipy()
[20]:
<COOrdinate sparse matrix of dtype 'float32'
with 4848644 stored elements and shape (2638, 1838)>
Reading slices of data¶
Similarly to DataFrame
, we can retrieve subsets of the data that can fit in memory. This is particularly important with SparseNDArray
s since often those are several gigabytes.
Unlike DataFrame
s, SparseNDArray
s are always indexed using an offset (zero-based) integer on each dimension. Therefore, if the array is N-dimensional, the .read()
method can accept a n-tuple (or list) argument that specifies how to slice the array. An empty element or slice(None)
means select all in that dimension.
For example, here’s how to fetch the first 5 rows of the matrix:
[21]:
Y = X.read([slice(0, 5)]).coos().concat()
Y
[21]:
<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)
Being only 5 rows, this slice can fit in memory even for bigger matrices than the one used in the example. Note that we can’t simply materialize to a dense matrix since the shape is too big (running Y.to_scipy().todense()
will raise an error), so we need to set bounding boxes:
[22]:
Y = X.read((slice(0, 5),)).coos((n_obs, n_var)).concat()
Y
[22]:
<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)
Now we can get a dense representation of it:
[23]:
Y.to_scipy().todense()
[23]:
matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
-0.2090951 , -0.5312034 ],
[-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,
-0.31314576, -0.5966544 ],
[-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
-0.17087643, 1.379 ],
...,
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ]], dtype=float32)
Alternatively, we can convert it to a scipy.sparse.csr_matrix
which allows to select specific rows:
[24]:
Z = Y.to_scipy().tocsr()
Z
[24]:
<Compressed Sparse Row sparse matrix of dtype 'float32'
with 11028 stored elements and shape (2638, 1838)>
[25]:
Z.getrow(0)
[25]:
<Compressed Sparse Row sparse matrix of dtype 'float32'
with 1838 stored elements and shape (1, 1838)>
Similarly, we can slice the original SparseNDArray
using single rows:
[26]:
X.read([[0,2]]).coos((n_obs, n_var)).concat().to_scipy().todense()
[26]:
matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
-0.2090951 , -0.5312034 ],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
-0.17087643, 1.379 ],
...,
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ]], dtype=float32)
The same approach can be used to filter across all the dimensions.
[27]:
experiment.obs.read().concat().to_pandas()
[27]:
soma_joinid | obs_id | n_genes | percent_mito | n_counts | louvain | |
---|---|---|---|---|---|---|
0 | 0 | AAACATACAACCAC-1 | 781 | 0.030178 | 2419.0 | CD4 T cells |
1 | 1 | AAACATTGAGCTAC-1 | 1352 | 0.037936 | 4903.0 | B cells |
2 | 2 | AAACATTGATCAGC-1 | 1131 | 0.008897 | 3147.0 | CD4 T cells |
3 | 3 | AAACCGTGCTTCCG-1 | 960 | 0.017431 | 2639.0 | CD14+ Monocytes |
4 | 4 | AAACCGTGTATGCG-1 | 522 | 0.012245 | 980.0 | NK cells |
... | ... | ... | ... | ... | ... | ... |
2633 | 2633 | TTTCGAACTCTCAT-1 | 1155 | 0.021104 | 3459.0 | CD14+ Monocytes |
2634 | 2634 | TTTCTACTGAGGCA-1 | 1227 | 0.009294 | 3443.0 | B cells |
2635 | 2635 | TTTCTACTTCCTCG-1 | 622 | 0.021971 | 1684.0 | B cells |
2636 | 2636 | TTTGCATGAGAGGC-1 | 454 | 0.020548 | 1022.0 | B cells |
2637 | 2637 | TTTGCATGCCTCAC-1 | 724 | 0.008065 | 1984.0 | CD4 T cells |
2638 rows × 6 columns
[28]:
var = experiment.ms["RNA"].var
var.read().concat().to_pandas()
[28]:
soma_joinid | var_id | n_cells | |
---|---|---|---|
0 | 0 | TNFRSF4 | 155 |
1 | 1 | CPSF3L | 202 |
2 | 2 | ATAD3C | 9 |
3 | 3 | C1orf86 | 501 |
4 | 4 | RER1 | 608 |
... | ... | ... | ... |
1833 | 1833 | ICOSLG | 34 |
1834 | 1834 | SUMO3 | 570 |
1835 | 1835 | SLC19A1 | 31 |
1836 | 1836 | S100B | 94 |
1837 | 1837 | PRMT2 | 588 |
1838 rows × 3 columns
Exercise: compute raw counts for a gene¶
In this exercise, we will compute the raw counts for a gene. We will only use slices, so at no point the SparseNDArray
will be fully in memory.
Let’s start by looking at a specific gene (ATAD3C
) in the var
dataframe:
[29]:
var.read(column_names=["soma_joinid", "var_id"], value_filter="var_id == 'ATAD3C'").concat().to_pandas()
[29]:
soma_joinid | var_id | |
---|---|---|
0 | 2 | ATAD3C |
In order to verify the raw counts, we need to move to the raw
layer, which can be found in the experiment:
[30]:
experiment.ms["raw"]
[30]:
<Measurement 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw' (open for 'r') (2 items)
'X': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw/X' (unopened)
'var': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw/var' (unopened)>
Let’s start by looking up the same gene in the raw var
dataframe:
[31]:
raw_var = experiment["ms"]["raw"].var
raw_var.read(column_names=["soma_joinid", "var_id"], value_filter="var_id == 'ATAD3C'").concat().to_pandas()
[31]:
soma_joinid | var_id | |
---|---|---|
0 | 30 | ATAD3C |
Note the soma_joinid
column. This is a column that can be used to join related SOMA objects in the experiment. In this case, it can be used to index the raw.X
matrix second dimension. Therefore, we just need to slice across that dimension, convert the matrix and count the nonzero entries:
[32]:
X_raw = experiment["ms"]["raw"].X["data"]
X_raw.read((slice(None), [30])).coos().concat().to_scipy().nnz
[32]:
8