Tutorial: Reading SOMA Objects¶

In this notebook we’ll learn how to read from various SOMA objects. We will assume familiarity with SOMA objects already, so it is recommended to go through the Tutorial: SOMA Objects before.

This implementation of SOMA relies on TileDB, which is a storage format that allows working with large files without having to fully load them in memory. Files can be either read from disk or from a remote source, like an S3 bucket.

The core feature of SOMA is to allow reading subsets of the data using slices: only the portion of required data is read from disk/network. SOMA uses Apache Arrow as an intermediate in-memory storage. From here, the slices can be further converted into more familiar formats, like a scipy.sparse matrix or a numpy ndarray. Consult the Python bindings for Apache Arrow documentation for more information.

In this notebook, we will use the Peripheral Blood Mononuclear Cells (PBMC) dataset. We will focus on reading from its obs DataFrame and from the X SparseNDArray. This is a small dataset that can fit in memory, but we’ll focus on operations that work on subsets of data that will work on larger datasets as well.

Reading a DataFrame¶

Introduction¶

[1]:

import tiledbsoma

[2]:

import tarfile
import tempfile

sparse_uri = tempfile.mktemp()
with tarfile.open("data/pbmc3k-sparse.tgz") as handle:
    handle.extractall(sparse_uri)
experiment = tiledbsoma.Experiment.open(sparse_uri)

All read operations need to be performed using the .read() method. For a DataFrame, we want to then call .concat() to obtain a PyArrow Table:

[3]:

obs = experiment.obs
table = obs.read().concat()
table

[3]:

pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
----
soma_joinid: [[0,1,2,3,4,...,2633,2634,2635,2636,2637]]
obs_id: [["AAACATACAACCAC-1","AAACATTGAGCTAC-1","AAACATTGATCAGC-1","AAACCGTGCTTCCG-1","AAACCGTGTATGCG-1",...,"TTTCGAACTCTCAT-1","TTTCTACTGAGGCA-1","TTTCTACTTCCTCG-1","TTTGCATGAGAGGC-1","TTTGCATGCCTCAC-1"]]
n_genes: [[781,1352,1131,960,522,...,1155,1227,622,454,724]]
percent_mito: [[0.030177759,0.037935957,0.008897362,0.017430846,0.012244898,...,0.021104366,0.00929422,0.021971496,0.020547945,0.008064516]]
n_counts: [[2419,4903,3147,2639,980,...,3459,3443,1684,1022,1984]]
louvain: [  -- dictionary:
["CD4 T cells","CD14+ Monocytes","B cells","CD8 T cells","NK cells","FCGR3A+ Monocytes","Dendritic cells","Megakaryocytes"]  -- indices:
[0,2,0,1,4,...,1,2,2,2,0]]

From here, we can directly use any of the PyArrow Table methods, for instance:

[4]:

table.sort_by([("n_genes", "descending")])

[4]:

pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
----
soma_joinid: [[270,1163,1891,926,277,...,2186,1522,662,1288,1840]]
obs_id: [["ACGAACTGGCTATG-1","CGATACGACAGGAG-1","GGGCCAACCTTGGA-1","CAGGTTGAGGATCT-1","ACGAGGGACAGGAG-1",...,"TAGTCTTGGCTGTA-1","GACGCTCTCTCTCG-1","ATCTCAACCTCGAA-1","CTAATAGAGCTATG-1","GGCATATGGGGAGT-1"]]
n_genes: [[2455,2033,2020,2000,1997,...,270,267,246,239,212]]
percent_mito: [[0.015774649,0.022166021,0.010576352,0.026962927,0.014631685,...,0,0.032258064,0,0.0016666667,0.012173913]]
n_counts: [[8875,6722,8415,8011,7928,...,652,682,609,600,575]]
louvain: [  -- dictionary:
["CD4 T cells","CD14+ Monocytes","B cells","CD8 T cells","NK cells","FCGR3A+ Monocytes","Dendritic cells","Megakaryocytes"]  -- indices:
[7,0,6,2,6,...,0,7,0,0,7]]

Alternatively, we can convert the DataFrame to a different format, like a Pandas DataFrame:

[5]:

table.to_pandas()

[5]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	0	AAACATACAACCAC-1	781	0.030178	2419.0	CD4 T cells
1	1	AAACATTGAGCTAC-1	1352	0.037936	4903.0	B cells
2	2	AAACATTGATCAGC-1	1131	0.008897	3147.0	CD4 T cells
3	3	AAACCGTGCTTCCG-1	960	0.017431	2639.0	CD14+ Monocytes
4	4	AAACCGTGTATGCG-1	522	0.012245	980.0	NK cells
...	...	...	...	...	...	...
2633	2633	TTTCGAACTCTCAT-1	1155	0.021104	3459.0	CD14+ Monocytes
2634	2634	TTTCTACTGAGGCA-1	1227	0.009294	3443.0	B cells
2635	2635	TTTCTACTTCCTCG-1	622	0.021971	1684.0	B cells
2636	2636	TTTGCATGAGAGGC-1	454	0.020548	1022.0	B cells
2637	2637	TTTGCATGCCTCAC-1	724	0.008065	1984.0	CD4 T cells

2638 rows × 6 columns

Reading slices of data¶

As previously mentioned, the core feature of SOMA is reading slices of the data without fetching the whole dataset in memory. To do that, the .read() method supports a coords parameter that allows data slicing.

Before we do that, let’s take a look at the schema of the obs dataframe:

[6]:

obs.schema

[6]:

soma_joinid: int64 not null
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>

And also its domain:

[7]:

obs.domain

[7]:

((0, 2637),)

With a SOMA DataFrame, you can only slice across an indexed column, so let’s look at the indexed columns:

[8]:

obs.index_column_names

[8]:

('soma_joinid',)

In this case our index consists of just soma_joinid, which is an integer column that can be used to join other SOMA objects in the same experiment.

Let’s look at a few ways to slice the dataframe.

Select a single row¶

[9]:

obs.read([[0]]).concat().to_pandas()

[9]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	0	AAACATACAACCAC-1	781	0.030178	2419.0	CD4 T cells

Select multiple, non contiguous rows¶

[10]:

obs.read([[2, 5]]).concat().to_pandas()

[10]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	2	AAACATTGATCAGC-1	1131	0.008897	3147.0	CD4 T cells
1	5	AAACGCACTGGTAC-1	782	0.016644	2163.0	CD8 T cells

Select a slice of rows¶

[11]:

obs.read([slice(0, 5)]).concat().to_pandas()

[11]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	0	AAACATACAACCAC-1	781	0.030178	2419.0	CD4 T cells
1	1	AAACATTGAGCTAC-1	1352	0.037936	4903.0	B cells
2	2	AAACATTGATCAGC-1	1131	0.008897	3147.0	CD4 T cells
3	3	AAACCGTGCTTCCG-1	960	0.017431	2639.0	CD14+ Monocytes
4	4	AAACCGTGTATGCG-1	522	0.012245	980.0	NK cells
5	5	AAACGCACTGGTAC-1	782	0.016644	2163.0	CD8 T cells

Select a subset of columns only¶

[12]:

obs.read([slice(0, 5)], column_names=["obs_id", "louvain"]).concat().to_pandas()

[12]:

	obs_id	louvain
0	AAACATACAACCAC-1	CD4 T cells
1	AAACATTGAGCTAC-1	B cells
2	AAACATTGATCAGC-1	CD4 T cells
3	AAACCGTGCTTCCG-1	CD14+ Monocytes
4	AAACCGTGTATGCG-1	NK cells
5	AAACGCACTGGTAC-1	CD8 T cells

Filter data using complex queries¶

SOMA also allows to filter data using more complex queries. For a more detailed reference, take a look at the query condition source code.

Here are a few examples:

Filter all cells with a Louvain categorization of “B cells”¶

[13]:

obs.read(value_filter="louvain == 'B cells'").concat().to_pandas()

[13]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	1	AAACATTGAGCTAC-1	1352	0.037936	4903.0	B cells
1	10	AAACTTGAAAAACG-1	1116	0.026316	3914.0	B cells
2	18	AAAGGCCTGTCTAG-1	1446	0.015283	4973.0	B cells
3	19	AAAGTTTGATCACG-1	446	0.034700	1268.0	B cells
4	20	AAAGTTTGGGGTGA-1	1020	0.025907	3281.0	B cells
...	...	...	...	...	...	...
337	2628	TTTCAGTGTCACGA-1	700	0.034314	1632.0	B cells
338	2630	TTTCAGTGTGCAGT-1	637	0.018925	1321.0	B cells
339	2634	TTTCTACTGAGGCA-1	1227	0.009294	3443.0	B cells
340	2635	TTTCTACTTCCTCG-1	622	0.021971	1684.0	B cells
341	2636	TTTGCATGAGAGGC-1	454	0.020548	1022.0	B cells

342 rows × 6 columns

Filter all cells with a Louvain categorization of either “CD4 T cells” or “CD8 T cells”¶

[14]:

obs.read(value_filter="(louvain == 'CD4 T cells') or (louvain == 'CD8 T cells')").concat().to_pandas()

[14]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	0	AAACATACAACCAC-1	781	0.030178	2419.0	CD4 T cells
1	2	AAACATTGATCAGC-1	1131	0.008897	3147.0	CD4 T cells
2	5	AAACGCACTGGTAC-1	782	0.016644	2163.0	CD8 T cells
3	6	AAACGCTGACCAGT-1	783	0.038161	2175.0	CD8 T cells
4	7	AAACGCTGGTTCTT-1	790	0.030973	2260.0	CD8 T cells
...	...	...	...	...	...	...
1455	2621	TTTAGCTGATACCG-1	887	0.022876	2754.0	CD4 T cells
1456	2626	TTTCACGAGGTTCA-1	721	0.013261	2036.0	CD4 T cells
1457	2627	TTTCAGTGGAAGGC-1	692	0.015169	1780.0	CD8 T cells
1458	2631	TTTCCAGAGGTGAG-1	873	0.006859	2187.0	CD4 T cells
1459	2637	TTTGCATGCCTCAC-1	724	0.008065	1984.0	CD4 T cells

1460 rows × 6 columns

Filter all cells with a Louvain categorization of “CD4 T cells” and more than 1500 genes¶

[15]:

obs.read(value_filter="(louvain == 'CD4 T cells') and (n_genes > 1500)").concat().to_pandas()

[15]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	26	AAATCAACCCTATT-1	1545	0.024313	5676.0	CD4 T cells
1	357	ACTCTCCTGCATAC-1	1750	0.017436	5850.0	CD4 T cells
2	473	AGCTGCCTTTCATC-1	1703	0.029547	5212.0	CD4 T cells
3	945	CATACTTGGGTTAC-1	1938	0.023580	7167.0	CD4 T cells
4	1163	CGATACGACAGGAG-1	2033	0.022166	6722.0	CD4 T cells
5	1320	CTATACTGTTCGTT-1	1543	0.012395	4760.0	CD4 T cells
6	1548	GAGCATACTTTGCT-1	1753	0.016739	6691.0	CD4 T cells
7	1993	GTGATGACAAGTGA-1	1819	0.021172	6329.0	CD4 T cells
8	2313	TCGGACCTGTACAC-1	1567	0.014288	5599.0	CD4 T cells
9	2365	TGAGACACAAGGTA-1	1549	0.013242	5135.0	CD4 T cells

Reading a SparseNDArray¶

For SparseNDArray, let’s consider the X matrix:

[16]:

X = experiment.ms["RNA"].X["data"]
X

[16]:

<SparseNDArray 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/RNA/X/data' (open for 'r')>

Similarly to DataFrame, we need to use the .read() method:

[17]:

X.read()

[17]:

<tiledbsoma._sparse_nd_array.SparseNDArrayRead at 0x119ba4ad0>

In this case, we have two options. Let’s start by converting this into an Arrow SparseCOOTensor:

[18]:

tensor = X.read().coos().concat()
tensor

[18]:

<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)

In this example, we obtain a 2-dimensional tensor (a matrix). Note that shape here indicates the capacity of the tensor, rather than the actual size.

By default, a SparseNDArray gets created with a much higher capacity to accommodate further writes. Since this is a read scenario, and the shape of the matrix is known, we can call .coos() with a parameter so that the SparseNDArray is resized accordingly:

[19]:

n_obs = len(obs)
n_var = len(experiment.ms["RNA"].var)

tensor = X.read().coos((n_obs, n_var)).concat()
tensor

[19]:

<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)

We can convert this to a scipy.sparse.coo_matrix:

[20]:

tensor.to_scipy()

[20]:

<COOrdinate sparse matrix of dtype 'float32'
        with 4848644 stored elements and shape (2638, 1838)>

Reading slices of data¶

Similarly to DataFrame, we can retrieve subsets of the data that can fit in memory. This is particularly important with SparseNDArrays since often those are several gigabytes.

Unlike DataFrames, SparseNDArrays are always indexed using an offset (zero-based) integer on each dimension. Therefore, if the array is N-dimensional, the .read() method can accept a n-tuple (or list) argument that specifies how to slice the array. An empty element or slice(None) means select all in that dimension.

For example, here’s how to fetch the first 5 rows of the matrix:

[21]:

Y = X.read([slice(0, 5)]).coos().concat()
Y

[21]:

<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)

Being only 5 rows, this slice can fit in memory even for bigger matrices than the one used in the example. Note that we can’t simply materialize to a dense matrix since the shape is too big (running Y.to_scipy().todense() will raise an error), so we need to set bounding boxes:

[22]:

Y = X.read((slice(0, 5),)).coos((n_obs, n_var)).concat()
Y

[22]:

<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)

Now we can get a dense representation of it:

[23]:

Y.to_scipy().todense()

[23]:

matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
         -0.2090951 , -0.5312034 ],
        [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,
         -0.31314576, -0.5966544 ],
        [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
         -0.17087643,  1.379     ],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]], dtype=float32)

Alternatively, we can convert it to a scipy.sparse.csr_matrix which allows to select specific rows:

[24]:

Z = Y.to_scipy().tocsr()
Z

[24]:

<Compressed Sparse Row sparse matrix of dtype 'float32'
        with 11028 stored elements and shape (2638, 1838)>

[25]:

Z.getrow(0)

[25]:

<Compressed Sparse Row sparse matrix of dtype 'float32'
        with 1838 stored elements and shape (1, 1838)>

Similarly, we can slice the original SparseNDArray using single rows:

[26]:

X.read([[0,2]]).coos((n_obs, n_var)).concat().to_scipy().todense()

[26]:

matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
         -0.2090951 , -0.5312034 ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
         -0.17087643,  1.379     ],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]], dtype=float32)

The same approach can be used to filter across all the dimensions.

[27]:

experiment.obs.read().concat().to_pandas()

[27]:

	soma_joinid	obs_id	n_genes	percent_mito	n_counts	louvain
0	0	AAACATACAACCAC-1	781	0.030178	2419.0	CD4 T cells
1	1	AAACATTGAGCTAC-1	1352	0.037936	4903.0	B cells
2	2	AAACATTGATCAGC-1	1131	0.008897	3147.0	CD4 T cells
3	3	AAACCGTGCTTCCG-1	960	0.017431	2639.0	CD14+ Monocytes
4	4	AAACCGTGTATGCG-1	522	0.012245	980.0	NK cells
...	...	...	...	...	...	...
2633	2633	TTTCGAACTCTCAT-1	1155	0.021104	3459.0	CD14+ Monocytes
2634	2634	TTTCTACTGAGGCA-1	1227	0.009294	3443.0	B cells
2635	2635	TTTCTACTTCCTCG-1	622	0.021971	1684.0	B cells
2636	2636	TTTGCATGAGAGGC-1	454	0.020548	1022.0	B cells
2637	2637	TTTGCATGCCTCAC-1	724	0.008065	1984.0	CD4 T cells

2638 rows × 6 columns

[28]:

var = experiment.ms["RNA"].var
var.read().concat().to_pandas()

[28]:

	soma_joinid	var_id	n_cells
0	0	TNFRSF4	155
1	1	CPSF3L	202
2	2	ATAD3C	9
3	3	C1orf86	501
4	4	RER1	608
...	...	...	...
1833	1833	ICOSLG	34
1834	1834	SUMO3	570
1835	1835	SLC19A1	31
1836	1836	S100B	94
1837	1837	PRMT2	588

1838 rows × 3 columns

Exercise: compute raw counts for a gene¶

In this exercise, we will compute the raw counts for a gene. We will only use slices, so at no point the SparseNDArray will be fully in memory.

Let’s start by looking at a specific gene (ATAD3C) in the var dataframe:

[29]:

var.read(column_names=["soma_joinid", "var_id"], value_filter="var_id == 'ATAD3C'").concat().to_pandas()

[29]:

	soma_joinid	var_id
0	2	ATAD3C

In order to verify the raw counts, we need to move to the raw layer, which can be found in the experiment:

[30]:

experiment.ms["raw"]

[30]:

<Measurement 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw' (open for 'r') (2 items)
    'X': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw/X' (unopened)
    'var': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw/var' (unopened)>

Let’s start by looking up the same gene in the raw var dataframe:

[31]:

raw_var = experiment["ms"]["raw"].var
raw_var.read(column_names=["soma_joinid", "var_id"], value_filter="var_id == 'ATAD3C'").concat().to_pandas()

[31]:

	soma_joinid	var_id
0	30	ATAD3C

Note the soma_joinid column. This is a column that can be used to join related SOMA objects in the experiment. In this case, it can be used to index the raw.X matrix second dimension. Therefore, we just need to slice across that dimension, convert the matrix and count the nonzero entries:

[32]:

X_raw = experiment["ms"]["raw"].X["data"]
X_raw.read((slice(None), [30])).coos().concat().to_scipy().nnz

[32]: