Tutorial: Reading SOMA Objects

In this notebook we’ll learn how to read from various SOMA objects. We will assume familiarity with SOMA objects already, so it is recommended to go through the Tutorial: SOMA Objects before.

This implementation of SOMA relies on TileDB, which is a storage format that allows working with large files without having to fully load them in memory. Files can be either read from disk or from a remote source, like an S3 bucket.

The core feature of SOMA is to allow reading subsets of the data using slices: only the portion of required data is read from disk/network. SOMA uses Apache Arrow as an intermediate in-memory storage. From here, the slices can be further converted into more familiar formats, like a scipy.sparse matrix or a numpy ndarray. Consult the Python bindings for Apache Arrow documentation for more information.

In this notebook, we will use the Peripheral Blood Mononuclear Cells (PBMC) dataset. We will focus on reading from its obs DataFrame and from the X SparseNDArray. This is a small dataset that can fit in memory, but we’ll focus on operations that work on subsets of data that will work on larger datasets as well.

Reading a DataFrame

Introduction

[1]:
import tiledbsoma
[2]:
import tarfile
import tempfile

sparse_uri = tempfile.mktemp()
with tarfile.open("data/pbmc3k-sparse.tgz") as handle:
    handle.extractall(sparse_uri)
experiment = tiledbsoma.Experiment.open(sparse_uri)

All read operations need to be performed using the .read() method. For a DataFrame, we want to then call .concat() to obtain a PyArrow Table:

[3]:
obs = experiment.obs
table = obs.read().concat()
table
[3]:
pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
----
soma_joinid: [[0,1,2,3,4,...,2633,2634,2635,2636,2637]]
obs_id: [["AAACATACAACCAC-1","AAACATTGAGCTAC-1","AAACATTGATCAGC-1","AAACCGTGCTTCCG-1","AAACCGTGTATGCG-1",...,"TTTCGAACTCTCAT-1","TTTCTACTGAGGCA-1","TTTCTACTTCCTCG-1","TTTGCATGAGAGGC-1","TTTGCATGCCTCAC-1"]]
n_genes: [[781,1352,1131,960,522,...,1155,1227,622,454,724]]
percent_mito: [[0.030177759,0.037935957,0.008897362,0.017430846,0.012244898,...,0.021104366,0.00929422,0.021971496,0.020547945,0.008064516]]
n_counts: [[2419,4903,3147,2639,980,...,3459,3443,1684,1022,1984]]
louvain: [  -- dictionary:
["CD4 T cells","CD14+ Monocytes","B cells","CD8 T cells","NK cells","FCGR3A+ Monocytes","Dendritic cells","Megakaryocytes"]  -- indices:
[0,2,0,1,4,...,1,2,2,2,0]]

From here, we can directly use any of the PyArrow Table methods, for instance:

[4]:
table.sort_by([("n_genes", "descending")])
[4]:
pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>
----
soma_joinid: [[270,1163,1891,926,277,...,2186,1522,662,1288,1840]]
obs_id: [["ACGAACTGGCTATG-1","CGATACGACAGGAG-1","GGGCCAACCTTGGA-1","CAGGTTGAGGATCT-1","ACGAGGGACAGGAG-1",...,"TAGTCTTGGCTGTA-1","GACGCTCTCTCTCG-1","ATCTCAACCTCGAA-1","CTAATAGAGCTATG-1","GGCATATGGGGAGT-1"]]
n_genes: [[2455,2033,2020,2000,1997,...,270,267,246,239,212]]
percent_mito: [[0.015774649,0.022166021,0.010576352,0.026962927,0.014631685,...,0,0.032258064,0,0.0016666667,0.012173913]]
n_counts: [[8875,6722,8415,8011,7928,...,652,682,609,600,575]]
louvain: [  -- dictionary:
["CD4 T cells","CD14+ Monocytes","B cells","CD8 T cells","NK cells","FCGR3A+ Monocytes","Dendritic cells","Megakaryocytes"]  -- indices:
[7,0,6,2,6,...,0,7,0,0,7]]

Alternatively, we can convert the DataFrame to a different format, like a Pandas DataFrame:

[5]:
table.to_pandas()
[5]:
soma_joinid obs_id n_genes percent_mito n_counts louvain
0 0 AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells
1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells
2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells
3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 CD14+ Monocytes
4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 NK cells
... ... ... ... ... ... ...
2633 2633 TTTCGAACTCTCAT-1 1155 0.021104 3459.0 CD14+ Monocytes
2634 2634 TTTCTACTGAGGCA-1 1227 0.009294 3443.0 B cells
2635 2635 TTTCTACTTCCTCG-1 622 0.021971 1684.0 B cells
2636 2636 TTTGCATGAGAGGC-1 454 0.020548 1022.0 B cells
2637 2637 TTTGCATGCCTCAC-1 724 0.008065 1984.0 CD4 T cells

2638 rows × 6 columns

Reading slices of data

As previously mentioned, the core feature of SOMA is reading slices of the data without fetching the whole dataset in memory. To do that, the .read() method supports a coords parameter that allows data slicing.

Before we do that, let’s take a look at the schema of the obs dataframe:

[6]:
obs.schema
[6]:
soma_joinid: int64 not null
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int32, ordered=0>

And also its domain:

[7]:
obs.domain
[7]:
((0, 2637),)

With a SOMA DataFrame, you can only slice across an indexed column, so let’s look at the indexed columns:

[8]:
obs.index_column_names
[8]:
('soma_joinid',)

In this case our index consists of just soma_joinid, which is an integer column that can be used to join other SOMA objects in the same experiment.

Let’s look at a few ways to slice the dataframe.

Select a single row

[9]:
obs.read([[0]]).concat().to_pandas()
[9]:
soma_joinid obs_id n_genes percent_mito n_counts louvain
0 0 AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells

Select multiple, non contiguous rows

[10]:
obs.read([[2, 5]]).concat().to_pandas()
[10]:
soma_joinid obs_id n_genes percent_mito n_counts louvain
0 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells
1 5 AAACGCACTGGTAC-1 782 0.016644 2163.0 CD8 T cells

Select a slice of rows

[11]:
obs.read([slice(0, 5)]).concat().to_pandas()
[11]:
soma_joinid obs_id n_genes percent_mito n_counts louvain
0 0 AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells
1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells
2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells
3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 CD14+ Monocytes
4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 NK cells
5 5 AAACGCACTGGTAC-1 782 0.016644 2163.0 CD8 T cells

Select a subset of columns only

[12]:
obs.read([slice(0, 5)], column_names=["obs_id", "louvain"]).concat().to_pandas()
[12]:
obs_id louvain
0 AAACATACAACCAC-1 CD4 T cells
1 AAACATTGAGCTAC-1 B cells
2 AAACATTGATCAGC-1 CD4 T cells
3 AAACCGTGCTTCCG-1 CD14+ Monocytes
4 AAACCGTGTATGCG-1 NK cells
5 AAACGCACTGGTAC-1 CD8 T cells

Filter data using complex queries

SOMA also allows to filter data using more complex queries. For a more detailed reference, take a look at the query condition source code.

Here are a few examples:

Filter all cells with a Louvain categorization of “B cells”

[13]:
obs.read(value_filter="louvain == 'B cells'").concat().to_pandas()
[13]:
soma_joinid obs_id n_genes percent_mito n_counts louvain
0 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells
1 10 AAACTTGAAAAACG-1 1116 0.026316 3914.0 B cells
2 18 AAAGGCCTGTCTAG-1 1446 0.015283 4973.0 B cells
3 19 AAAGTTTGATCACG-1 446 0.034700 1268.0 B cells
4 20 AAAGTTTGGGGTGA-1 1020 0.025907 3281.0 B cells
... ... ... ... ... ... ...
337 2628 TTTCAGTGTCACGA-1 700 0.034314 1632.0 B cells
338 2630 TTTCAGTGTGCAGT-1 637 0.018925 1321.0 B cells
339 2634 TTTCTACTGAGGCA-1 1227 0.009294 3443.0 B cells
340 2635 TTTCTACTTCCTCG-1 622 0.021971 1684.0 B cells
341 2636 TTTGCATGAGAGGC-1 454 0.020548 1022.0 B cells

342 rows × 6 columns

Filter all cells with a Louvain categorization of either “CD4 T cells” or “CD8 T cells”

[14]:
obs.read(value_filter="(louvain == 'CD4 T cells') or (louvain == 'CD8 T cells')").concat().to_pandas()
[14]:
soma_joinid obs_id n_genes percent_mito n_counts louvain
0 0 AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells
1 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells
2 5 AAACGCACTGGTAC-1 782 0.016644 2163.0 CD8 T cells
3 6 AAACGCTGACCAGT-1 783 0.038161 2175.0 CD8 T cells
4 7 AAACGCTGGTTCTT-1 790 0.030973 2260.0 CD8 T cells
... ... ... ... ... ... ...
1455 2621 TTTAGCTGATACCG-1 887 0.022876 2754.0 CD4 T cells
1456 2626 TTTCACGAGGTTCA-1 721 0.013261 2036.0 CD4 T cells
1457 2627 TTTCAGTGGAAGGC-1 692 0.015169 1780.0 CD8 T cells
1458 2631 TTTCCAGAGGTGAG-1 873 0.006859 2187.0 CD4 T cells
1459 2637 TTTGCATGCCTCAC-1 724 0.008065 1984.0 CD4 T cells

1460 rows × 6 columns

Filter all cells with a Louvain categorization of “CD4 T cells” and more than 1500 genes

[15]:
obs.read(value_filter="(louvain == 'CD4 T cells') and (n_genes > 1500)").concat().to_pandas()
[15]:
soma_joinid obs_id n_genes percent_mito n_counts louvain
0 26 AAATCAACCCTATT-1 1545 0.024313 5676.0 CD4 T cells
1 357 ACTCTCCTGCATAC-1 1750 0.017436 5850.0 CD4 T cells
2 473 AGCTGCCTTTCATC-1 1703 0.029547 5212.0 CD4 T cells
3 945 CATACTTGGGTTAC-1 1938 0.023580 7167.0 CD4 T cells
4 1163 CGATACGACAGGAG-1 2033 0.022166 6722.0 CD4 T cells
5 1320 CTATACTGTTCGTT-1 1543 0.012395 4760.0 CD4 T cells
6 1548 GAGCATACTTTGCT-1 1753 0.016739 6691.0 CD4 T cells
7 1993 GTGATGACAAGTGA-1 1819 0.021172 6329.0 CD4 T cells
8 2313 TCGGACCTGTACAC-1 1567 0.014288 5599.0 CD4 T cells
9 2365 TGAGACACAAGGTA-1 1549 0.013242 5135.0 CD4 T cells

Reading a SparseNDArray

For SparseNDArray, let’s consider the X matrix:

[16]:
X = experiment.ms["RNA"].X["data"]
X
[16]:
<SparseNDArray 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/RNA/X/data' (open for 'r')>

Similarly to DataFrame, we need to use the .read() method:

[17]:
X.read()
[17]:
<tiledbsoma._sparse_nd_array.SparseNDArrayRead at 0x119ba4ad0>

In this case, we have two options. Let’s start by converting this into an Arrow SparseCOOTensor:

[18]:
tensor = X.read().coos().concat()
tensor
[18]:
<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)

In this example, we obtain a 2-dimensional tensor (a matrix). Note that shape here indicates the capacity of the tensor, rather than the actual size.

By default, a SparseNDArray gets created with a much higher capacity to accommodate further writes. Since this is a read scenario, and the shape of the matrix is known, we can call .coos() with a parameter so that the SparseNDArray is resized accordingly:

[19]:
n_obs = len(obs)
n_var = len(experiment.ms["RNA"].var)

tensor = X.read().coos((n_obs, n_var)).concat()
tensor
[19]:
<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)

We can convert this to a scipy.sparse.coo_matrix:

[20]:
tensor.to_scipy()
[20]:
<COOrdinate sparse matrix of dtype 'float32'
        with 4848644 stored elements and shape (2638, 1838)>

Reading slices of data

Similarly to DataFrame, we can retrieve subsets of the data that can fit in memory. This is particularly important with SparseNDArrays since often those are several gigabytes.

Unlike DataFrames, SparseNDArrays are always indexed using an offset (zero-based) integer on each dimension. Therefore, if the array is N-dimensional, the .read() method can accept a n-tuple (or list) argument that specifies how to slice the array. An empty element or slice(None) means select all in that dimension.

For example, here’s how to fetch the first 5 rows of the matrix:

[21]:
Y = X.read([slice(0, 5)]).coos().concat()
Y
[21]:
<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)

Being only 5 rows, this slice can fit in memory even for bigger matrices than the one used in the example. Note that we can’t simply materialize to a dense matrix since the shape is too big (running Y.to_scipy().todense() will raise an error), so we need to set bounding boxes:

[22]:
Y = X.read((slice(0, 5),)).coos((n_obs, n_var)).concat()
Y
[22]:
<pyarrow.SparseCOOTensor>
type: float
shape: (2638, 1838)

Now we can get a dense representation of it:

[23]:
Y.to_scipy().todense()
[23]:
matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
         -0.2090951 , -0.5312034 ],
        [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,
         -0.31314576, -0.5966544 ],
        [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
         -0.17087643,  1.379     ],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]], dtype=float32)

Alternatively, we can convert it to a scipy.sparse.csr_matrix which allows to select specific rows:

[24]:
Z = Y.to_scipy().tocsr()
Z
[24]:
<Compressed Sparse Row sparse matrix of dtype 'float32'
        with 11028 stored elements and shape (2638, 1838)>
[25]:
Z.getrow(0)
[25]:
<Compressed Sparse Row sparse matrix of dtype 'float32'
        with 1838 stored elements and shape (1, 1838)>

Similarly, we can slice the original SparseNDArray using single rows:

[26]:
X.read([[0,2]]).coos((n_obs, n_var)).concat().to_scipy().todense()
[26]:
matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,
         -0.2090951 , -0.5312034 ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,
         -0.17087643,  1.379     ],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]], dtype=float32)

The same approach can be used to filter across all the dimensions.

[27]:
experiment.obs.read().concat().to_pandas()
[27]:
soma_joinid obs_id n_genes percent_mito n_counts louvain
0 0 AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells
1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells
2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells
3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 CD14+ Monocytes
4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 NK cells
... ... ... ... ... ... ...
2633 2633 TTTCGAACTCTCAT-1 1155 0.021104 3459.0 CD14+ Monocytes
2634 2634 TTTCTACTGAGGCA-1 1227 0.009294 3443.0 B cells
2635 2635 TTTCTACTTCCTCG-1 622 0.021971 1684.0 B cells
2636 2636 TTTGCATGAGAGGC-1 454 0.020548 1022.0 B cells
2637 2637 TTTGCATGCCTCAC-1 724 0.008065 1984.0 CD4 T cells

2638 rows × 6 columns

[28]:
var = experiment.ms["RNA"].var
var.read().concat().to_pandas()
[28]:
soma_joinid var_id n_cells
0 0 TNFRSF4 155
1 1 CPSF3L 202
2 2 ATAD3C 9
3 3 C1orf86 501
4 4 RER1 608
... ... ... ...
1833 1833 ICOSLG 34
1834 1834 SUMO3 570
1835 1835 SLC19A1 31
1836 1836 S100B 94
1837 1837 PRMT2 588

1838 rows × 3 columns

Exercise: compute raw counts for a gene

In this exercise, we will compute the raw counts for a gene. We will only use slices, so at no point the SparseNDArray will be fully in memory.

Let’s start by looking at a specific gene (ATAD3C) in the var dataframe:

[29]:
var.read(column_names=["soma_joinid", "var_id"], value_filter="var_id == 'ATAD3C'").concat().to_pandas()
[29]:
soma_joinid var_id
0 2 ATAD3C

In order to verify the raw counts, we need to move to the raw layer, which can be found in the experiment:

[30]:
experiment.ms["raw"]
[30]:
<Measurement 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw' (open for 'r') (2 items)
    'X': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw/X' (unopened)
    'var': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw/var' (unopened)>

Let’s start by looking up the same gene in the raw var dataframe:

[31]:
raw_var = experiment["ms"]["raw"].var
raw_var.read(column_names=["soma_joinid", "var_id"], value_filter="var_id == 'ATAD3C'").concat().to_pandas()
[31]:
soma_joinid var_id
0 30 ATAD3C

Note the soma_joinid column. This is a column that can be used to join related SOMA objects in the experiment. In this case, it can be used to index the raw.X matrix second dimension. Therefore, we just need to slice across that dimension, convert the matrix and count the nonzero entries:

[32]:
X_raw = experiment["ms"]["raw"].X["data"]
X_raw.read((slice(None), [30])).coos().concat().to_scipy().nnz
[32]:
8