Tutorial: SOMA Experiment queries

[1]:
import tiledbsoma as soma

In this notebook, we’ll take a quick look at the SOMA experiment-query API. The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics.

First we’ll unpack and open the experiment:

[2]:
import tarfile
import tempfile

sparse_uri = tempfile.mktemp()
with tarfile.open("data/pbmc3k-sparse.tgz") as handle:
    handle.extractall(sparse_uri)
exp = soma.Experiment.open(sparse_uri)

Using the keys of the obs dataframe, we can see what fields are available to query on.

[3]:
exp.obs.keys()
[3]:
('soma_joinid', 'obs_id', 'n_genes', 'percent_mito', 'n_counts', 'louvain')
[4]:
p = exp.obs.read(column_names=['louvain']).concat().to_pandas()
p
[4]:
louvain
0 CD4 T cells
1 B cells
2 CD4 T cells
3 CD14+ Monocytes
4 NK cells
... ...
2633 CD14+ Monocytes
2634 B cells
2635 B cells
2636 B cells
2637 CD4 T cells

2638 rows × 1 columns

Focusing on the louvain column, we can now find out what column values are present in the data.

[5]:
p.groupby('louvain').size().sort_values()
/var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/ipykernel_27669/1931588187.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  p.groupby('louvain').size().sort_values()
[5]:
louvain
Megakaryocytes         15
Dendritic cells        37
FCGR3A+ Monocytes     150
NK cells              154
CD8 T cells           316
B cells               342
CD14+ Monocytes       480
CD4 T cells          1144
dtype: int64

Now we can query the SOMA experiment – here, by a few cell types.

[6]:
obs_query = soma.AxisQuery(value_filter='louvain in ["B cells", "NK cells"]')
[7]:
query = exp.axis_query("RNA", obs_query=obs_query)

Note that the query output is smaller than the original dataset’s size – since we’ve queried for only a particular pair of cell types.

[8]:
[exp.obs.count, exp.ms["RNA"].var.count]
[8]:
[2638, 1838]
[9]:
[query.n_obs, query.n_vars]
[9]:
[496, 1838]

Here we can take a look at the X data.

[10]:
query.X("data").tables().concat().to_pandas()
[10]:
soma_dim_0 soma_dim_1 soma_data
0 1 0 -0.214582
1 1 1 -0.372653
2 1 2 -0.054804
3 1 3 -0.683391
4 1 4 0.633951
... ... ... ...
911643 2636 1833 -0.149789
911644 2636 1834 -0.325824
911645 2636 1835 -0.005918
911646 2636 1836 -0.135213
911647 2636 1837 -0.482111

911648 rows × 3 columns

To finish out this introductory look at the experiment-query API, we can convert our query outputs to AnnData format.

[11]:
adata = query.to_anndata(X_name="data")
[12]:
adata
[12]:
AnnData object with n_obs × n_vars = 496 × 1838
    obs: 'soma_joinid', 'obs_id', 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'soma_joinid', 'var_id', 'n_cells'
[ ]: