Tutorial: SOMA Experiment queries¶
[1]:
import tiledbsoma as soma
In this notebook, we’ll take a quick look at the SOMA experiment-query API. The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics.
First we’ll unpack and open the experiment:
[2]:
import tarfile
import tempfile
sparse_uri = tempfile.mktemp()
with tarfile.open("data/pbmc3k-sparse.tgz") as handle:
handle.extractall(sparse_uri)
exp = soma.Experiment.open(sparse_uri)
Using the keys of the obs
dataframe, we can see what fields are available to query on.
[3]:
exp.obs.keys()
[3]:
('soma_joinid', 'obs_id', 'n_genes', 'percent_mito', 'n_counts', 'louvain')
[4]:
p = exp.obs.read(column_names=['louvain']).concat().to_pandas()
p
[4]:
louvain | |
---|---|
0 | CD4 T cells |
1 | B cells |
2 | CD4 T cells |
3 | CD14+ Monocytes |
4 | NK cells |
... | ... |
2633 | CD14+ Monocytes |
2634 | B cells |
2635 | B cells |
2636 | B cells |
2637 | CD4 T cells |
2638 rows × 1 columns
Focusing on the louvain
column, we can now find out what column values are present in the data.
[5]:
p.groupby('louvain').size().sort_values()
/var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/ipykernel_27669/1931588187.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
p.groupby('louvain').size().sort_values()
[5]:
louvain
Megakaryocytes 15
Dendritic cells 37
FCGR3A+ Monocytes 150
NK cells 154
CD8 T cells 316
B cells 342
CD14+ Monocytes 480
CD4 T cells 1144
dtype: int64
Now we can query the SOMA experiment – here, by a few cell types.
[6]:
obs_query = soma.AxisQuery(value_filter='louvain in ["B cells", "NK cells"]')
[7]:
query = exp.axis_query("RNA", obs_query=obs_query)
Note that the query output is smaller than the original dataset’s size – since we’ve queried for only a particular pair of cell types.
[8]:
[exp.obs.count, exp.ms["RNA"].var.count]
[8]:
[2638, 1838]
[9]:
[query.n_obs, query.n_vars]
[9]:
[496, 1838]
Here we can take a look at the X data.
[10]:
query.X("data").tables().concat().to_pandas()
[10]:
soma_dim_0 | soma_dim_1 | soma_data | |
---|---|---|---|
0 | 1 | 0 | -0.214582 |
1 | 1 | 1 | -0.372653 |
2 | 1 | 2 | -0.054804 |
3 | 1 | 3 | -0.683391 |
4 | 1 | 4 | 0.633951 |
... | ... | ... | ... |
911643 | 2636 | 1833 | -0.149789 |
911644 | 2636 | 1834 | -0.325824 |
911645 | 2636 | 1835 | -0.005918 |
911646 | 2636 | 1836 | -0.135213 |
911647 | 2636 | 1837 | -0.482111 |
911648 rows × 3 columns
To finish out this introductory look at the experiment-query API, we can convert our query outputs to AnnData format.
[11]:
adata = query.to_anndata(X_name="data")
[12]:
adata
[12]:
AnnData object with n_obs × n_vars = 496 × 1838
obs: 'soma_joinid', 'obs_id', 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'soma_joinid', 'var_id', 'n_cells'
[ ]: