Tutorial: TileDB-SOMA append mode¶
As of TileDB-SOMA 1.5.0, we’re excited to offer support for append mode.
As of TileDB-SOMA 1.15.0, we’re proud to offer a shape
feature for dataframes and arrays within experimentts which more closely matches user expectations.
Use-cases include ingesting H5AD/AnnData from multiple sequencing runs over time, accumulating the data over time, into millions of cells.
First, we’ll do the usual package imports:
[3]:
import scanpy as sc
import tiledbsoma
import tiledbsoma.io
import tiledbsoma.logging
tiledbsoma.show_package_versions()
tiledbsoma.__version__ 1.16.0
TileDB core version (libtiledbsoma) 2.27.2
python version 3.11.10.final.0
OS version Darwin 24.3.0
Next we’ll set up where our data are going:
[4]:
import datetime
stamp = datetime.datetime.today().strftime("%Y%m%d-%H%M%S")
experiment_uri = f"/tmp/append-example-{stamp}"
experiment_uri
[4]:
'/tmp/append-example-20250307-175702'
For this demo, we’re writing to /tmp
, but URIs like the following allow storing data on TileDB Cloud, cloud storage such as S3, or instance-local NVME:
/var/data/mysoma1
s3://mybucket/mysoma2
tiledb://mynamespace/s3://mybucket/mysoma3
Everything in this notebook below this URI-selection cell is agnostic to the storage backend.
Create the initial SOMA Experiment¶
Next we’ll prep some input data. To make things easy for this self-contained demo, we’ll use Scanpy’s pbmc3k
, with a custom column.
[5]:
ad1 = sc.datasets.pbmc3k()
sc.pp.calculate_qc_metrics(ad1, inplace=True)
ad1.obs["when"] = ["Monday"] * len(ad1.obs)
Now we’re ready to ingest the data into a SOMA experiment. Since SOMA is multimodal, we’ll specify the destination modality, or measurement name, to be “RNA”.
[6]:
measurement_name = "RNA"
[7]:
tiledbsoma.logging.info()
tiledbsoma.io.from_anndata(experiment_uri, ad1, measurement_name=measurement_name)
Registration: registering isolated AnnData object.
Wrote /tmp/append-example-20250307-175702/obs
Wrote /tmp/append-example-20250307-175702/ms/RNA/var
Writing /tmp/append-example-20250307-175702/ms/RNA/X/data
Wrote /tmp/append-example-20250307-175702/ms/RNA/X/data
Wrote /tmp/append-example-20250307-175702
[7]:
'/tmp/append-example-20250307-175702'
Now let’s read back the data. We’ll take a look at obs
, var
, and X
.
obs: For this initial ingest, there are obs IDs ending in -1
, the when
is Monday
, and there are 2700 rows. Also note that since TileDB is a columnar database, when we select certain columns, those are the only ones loaded from disk. This positively impacts performance at cloud scale: you get what you asked for, without needing to wait for what you didn’t ask for.
[8]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
print(
exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
.concat()
.to_pandas()
)
obs_id n_genes_by_counts when
0 AAACATACAACCAC-1 781 Monday
1 AAACATTGAGCTAC-1 1352 Monday
2 AAACATTGATCAGC-1 1131 Monday
3 AAACCGTGCTTCCG-1 960 Monday
4 AAACCGTGTATGCG-1 522 Monday
... ... ... ...
2695 TTTCGAACTCTCAT-1 1155 Monday
2696 TTTCTACTGAGGCA-1 1227 Monday
2697 TTTCTACTTCCTCG-1 622 Monday
2698 TTTGCATGAGAGGC-1 454 Monday
2699 TTTGCATGCCTCAC-1 724 Monday
[2700 rows x 3 columns]
var: Let’s also look at var
, selecting out the join IDs (which index columns of X
) as well as the Ensembl-format and NCBI-format gene IDs:
[9]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
print(
exp.ms["RNA"]
.var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
.concat()
.to_pandas()
)
soma_joinid var_id gene_ids
0 0 MIR1302-10 ENSG00000243485
1 1 FAM138A ENSG00000237613
2 2 OR4F5 ENSG00000186092
3 3 RP11-34P13.7 ENSG00000238009
4 4 RP11-34P13.8 ENSG00000239945
... ... ... ...
32733 32733 AC145205.1 ENSG00000215635
32734 32734 BAGE5 ENSG00000268590
32735 32735 CU459201.1 ENSG00000251180
32736 32736 AC002321.2 ENSG00000215616
32737 32737 AC002321.1 ENSG00000215611
[32738 rows x 3 columns]
X: Lastly let’s look at the expression matrix, in COO format. (You can convert to other formats if you like.) Its rows and columns are indexed by the soma_joinid
of the obs
and var
dataframes, respectively.
[10]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
X = exp.ms["RNA"].X["data"]
print(X.read().tables().concat().to_pandas())
print()
print(X.shape)
soma_dim_0 soma_dim_1 soma_data
0 0 70 1.0
1 0 166 1.0
2 0 178 2.0
3 0 326 1.0
4 0 363 1.0
... ... ... ...
2286879 2699 32697 1.0
2286880 2699 32698 7.0
2286881 2699 32702 1.0
2286882 2699 32705 1.0
2286883 2699 32708 3.0
[2286884 rows x 3 columns]
(2700, 32738)
While you can ask all dataframes and arrays in the experiment for their .domain
or .shape
, respectively, one at a time, there’s also the handy show_experiment_shape
which traverses the experiment for you.
The dataframe domains and array shapes are soft limits on what values can be read from, or written to, them.
[11]:
tiledbsoma.io.show_experiment_shapes(experiment_uri)
[DataFrame] obs
URI file:///tmp/append-example-20250307-175702/obs
count 2700
non_empty_domain ((0, 2699),)
domain ((0, 2699),)
maxdomain ((0, 9223372036854773758),)
upgraded True
[DataFrame] ms/RNA/var
URI file:///tmp/append-example-20250307-175702/ms/RNA/var
count 32738
non_empty_domain ((0, 32737),)
domain ((0, 32737),)
maxdomain ((0, 9223372036854773758),)
upgraded True
[SparseNDArray] ms/RNA/X/data
URI file:///tmp/append-example-20250307-175702/ms/RNA/X/data
non_empty_domain ((0, 2699), (5, 32732))
shape (2700, 32738)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
[11]:
True
Appending a new dataset to the SOMA Experiment¶
Now, let’s simulate another day’s sequencing run. For simplicity of this demo notebook, we’ll mutate the previous dataset, changing the obs IDs to have a -2
suffix, and also putting Tuesday
in the when
column. Also, we’ll multiply the X
values by 10.
[12]:
ad2 = ad1.copy()
ad2.obs.index = [e.replace("-1", "-2") for e in ad1.obs.index]
ad2.obs["when"] = ["Tuesday"] * len(ad2.obs)
[13]:
ad2.X *= 10
Now we simply ingest as before – the only additional step is a black-box registration step which detects which cell IDs are new (here, all of them) and which gene IDs are new (here, none of them).
The registration takes two forms, either of which you can use depending on your use-case: tiledbsoma.io.register_anndatas
for in-memory AnnData objects, or tiledbsoma.io.register_h5ads
for on-storage AnnData objects.
[14]:
rd = tiledbsoma.io.register_anndatas(
experiment_uri,
[ad2],
measurement_name=measurement_name,
obs_field_name="obs_id",
var_field_name="var_id",
)
Registration: starting with experiment /tmp/append-example-20250307-175702
Registration: found nobs=2700 nvar=32738 from experiment.
Registration: registering AnnData object.
Registration: accumulated to nobs=5400 nvar=32738.
Registration: complete.
As described on in the tutorial on the TileDB-SOMA shape feature, the domain
of dataframes and the shape
of N-dimensional arrays are soft limits on what values can be read from or written to. In order to ingest more data, we’ll need to increase those soft limits.
First let’s look at what they currently are:
[15]:
tiledbsoma.io.show_experiment_shapes(exp.uri)
[DataFrame] obs
URI file:///tmp/append-example-20250307-175702/obs
count 2700
non_empty_domain ((0, 2699),)
domain ((0, 2699),)
maxdomain ((0, 9223372036854773758),)
upgraded True
[DataFrame] ms/RNA/var
URI file:///tmp/append-example-20250307-175702/ms/RNA/var
count 32738
non_empty_domain ((0, 32737),)
domain ((0, 32737),)
maxdomain ((0, 9223372036854773758),)
upgraded True
[SparseNDArray] ms/RNA/X/data
URI file:///tmp/append-example-20250307-175702/ms/RNA/X/data
non_empty_domain ((0, 2699), (5, 32732))
shape (2700, 32738)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
[15]:
True
Then we apply the resize, and look at the domains and shapes again:
[16]:
tiledbsoma.io.resize_experiment(exp.uri, nobs=rd.get_obs_shape(), nvars=rd.get_var_shapes())
[16]:
True
[17]:
tiledbsoma.io.show_experiment_shapes(exp.uri)
[DataFrame] obs
URI file:///tmp/append-example-20250307-175702/obs
count 2700
non_empty_domain ((0, 2699),)
domain ((0, 5399),)
maxdomain ((0, 9223372036854773758),)
upgraded True
[DataFrame] ms/RNA/var
URI file:///tmp/append-example-20250307-175702/ms/RNA/var
count 32738
non_empty_domain ((0, 32737),)
domain ((0, 32737),)
maxdomain ((0, 9223372036854773758),)
upgraded True
[SparseNDArray] ms/RNA/X/data
URI file:///tmp/append-example-20250307-175702/ms/RNA/X/data
non_empty_domain ((0, 2699), (5, 32732))
shape (5400, 32738)
maxshape (9223372036854773759, 9223372036854773759)
upgraded True
[17]:
True
Now we can ingest the new data:
[18]:
tiledbsoma.io.from_anndata(
experiment_uri,
ad2,
measurement_name=measurement_name,
registration_mapping=rd,
)
Wrote /tmp/append-example-20250307-175702/obs
Wrote /tmp/append-example-20250307-175702/ms/RNA/var
Writing /tmp/append-example-20250307-175702/ms/RNA/X/data
Wrote /tmp/append-example-20250307-175702/ms/RNA/X/data
Wrote file:///tmp/append-example-20250307-175702
[18]:
'file:///tmp/append-example-20250307-175702'
Now let’s read back the appended data. There are now obs IDs ending in -1
as well as -2
, the when
includes Monday
as well as Tuesday
, and there are 5400 rows.
(For Wednesday
and onward, it’ll simply be the same pattern – we can grow our data iteratively over time, to arbitrary sizes.)
[19]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
print(
exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
.concat()
.to_pandas()
)
obs_id n_genes_by_counts when
0 AAACATACAACCAC-1 781 Monday
1 AAACATTGAGCTAC-1 1352 Monday
2 AAACATTGATCAGC-1 1131 Monday
3 AAACCGTGCTTCCG-1 960 Monday
4 AAACCGTGTATGCG-1 522 Monday
... ... ... ...
5395 TTTCGAACTCTCAT-2 1155 Tuesday
5396 TTTCTACTGAGGCA-2 1227 Tuesday
5397 TTTCTACTTCCTCG-2 622 Tuesday
5398 TTTGCATGAGAGGC-2 454 Tuesday
5399 TTTGCATGCCTCAC-2 724 Tuesday
[5400 rows x 3 columns]
Let’s also look at var
, as before. Since we had data for more cells but for the same genes, there is nothing new here. The obs
table grew downward with the new cells, and X
grew downward with new rows, but var
stayed the same.
In real-world data, occasionally you will see a gene expressed in subsequent data which wasn’t expressed in the initial data. That’s fine – you’ll simply see var
grow just a bit for those newly encountered gene IDs, with corresponding new columns for X
.
[20]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
print(
exp.ms["RNA"]
.var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
.concat()
.to_pandas()
)
soma_joinid var_id gene_ids
0 0 MIR1302-10 ENSG00000243485
1 1 FAM138A ENSG00000237613
2 2 OR4F5 ENSG00000186092
3 3 RP11-34P13.7 ENSG00000238009
4 4 RP11-34P13.8 ENSG00000239945
... ... ... ...
32733 32733 AC145205.1 ENSG00000215635
32734 32734 BAGE5 ENSG00000268590
32735 32735 CU459201.1 ENSG00000251180
32736 32736 AC002321.2 ENSG00000215616
32737 32737 AC002321.1 ENSG00000215611
[32738 rows x 3 columns]
And lastly, the X
expression matrix which has grown downward with the new cells, while keeping the same width as we didn’t introduce new genes:
[21]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
X = exp.ms["RNA"].X["data"]
print(X.read().tables().concat().to_pandas())
print()
print(X.shape)
soma_dim_0 soma_dim_1 soma_data
0 0 70 1.0
1 0 166 1.0
2 0 178 2.0
3 0 326 1.0
4 0 363 1.0
... ... ... ...
4573763 5399 32697 10.0
4573764 5399 32698 70.0
4573765 5399 32702 10.0
4573766 5399 32705 10.0
4573767 5399 32708 30.0
[4573768 rows x 3 columns]
(5400, 32738)
Ingesting multiple datasets to a SOMA Experiment¶
Finally, we’ll demonstrate combining multiple AnnDatas into one new experiment.
The flow is pretty similar to the above:
One call to
register_anndatas
orregister_h5ads
(passing all input AnnDatas/h5ads)One call to
from_anndata
/from_h5ad
for each input AnnData
Here’s a helper function to simulate multiple lab runs. As above, where we used pbmc3k
to simulate Monday and Tuesday data, here we use pbmc3k
to simulate multiple AnnData objects.
[22]:
def make_ad(when, scale, obs_id_suffix):
ad = ad1.copy()
ad.obs.index = [e.replace("-1", obs_id_suffix) for e in ad.obs.index]
ad.obs["when"] = [when] * len(ad.obs)
ad.X *= scale
return ad
ads = [
make_ad(when, scale, f"-{idx + 3}")
for idx, (when, scale)
in enumerate({
"Wednesday": 20,
"Thursday": 30,
"Friday": 40,
}.items())
]
We’ll ingest these AnnData objects, as before, but this time to a fresh/empty /tmp
location:
[23]:
stamp = datetime.datetime.today().strftime("%Y%m%d-%H%M%S")
exp = None
experiment_uri = f"/tmp/append-example-{stamp}"
experiment_uri
[23]:
'/tmp/append-example-20250307-175704'
Here we’ll register all the AnnData objects. Note that the SOMA Experiment doesn’t exist yet, so we pass experiment_uri=None
to signify that.
[24]:
rd2 = tiledbsoma.io.register_anndatas(
experiment_uri=None, # new Experiment, from scratch
adatas=ads,
measurement_name=measurement_name,
obs_field_name="obs_id",
var_field_name="var_id",
)
Registration: registering AnnData object.
Registration: accumulated to nobs=2700 nvar=32738.
Registration: registering AnnData object.
Registration: accumulated to nobs=5400 nvar=32738.
Registration: registering AnnData object.
Registration: accumulated to nobs=8100 nvar=32738.
Registration: complete.
Now that we’ve gotten the registrations for all the input AnnData objects, we can ingest them.
Note:
Here we ingest them sequentially, in the same order as above.
But we could also ingest them in any shuffled order.
Or we could have multiple workers in ingest them in parallel, one worker per AnnData object, as long as the registration data are passed to each worker.
[25]:
for ad in ads:
if tiledbsoma.Experiment.exists(experiment_uri):
tiledbsoma.io.resize_experiment(
experiment_uri,
nobs=rd2.get_obs_shape(),
nvars=rd2.get_var_shapes()
)
tiledbsoma.io.from_anndata(
experiment_uri,
ad,
measurement_name=measurement_name,
registration_mapping=rd2,
)
Wrote /tmp/append-example-20250307-175704/obs
Wrote /tmp/append-example-20250307-175704/ms/RNA/var
Writing /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote /tmp/append-example-20250307-175704
Wrote /tmp/append-example-20250307-175704/obs
Wrote /tmp/append-example-20250307-175704/ms/RNA/var
Writing /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote file:///tmp/append-example-20250307-175704
Wrote /tmp/append-example-20250307-175704/obs
Wrote /tmp/append-example-20250307-175704/ms/RNA/var
Writing /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote file:///tmp/append-example-20250307-175704
Reading back the concatenated data, we see 2700 rows for each of {-3
, -4
, -5
}:
[26]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
print(
exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
.concat()
.to_pandas()
)
obs_id n_genes_by_counts when
0 AAACATACAACCAC-3 781 Wednesday
1 AAACATTGAGCTAC-3 1352 Wednesday
2 AAACATTGATCAGC-3 1131 Wednesday
3 AAACCGTGCTTCCG-3 960 Wednesday
4 AAACCGTGTATGCG-3 522 Wednesday
... ... ... ...
8095 TTTCGAACTCTCAT-5 1155 Friday
8096 TTTCTACTGAGGCA-5 1227 Friday
8097 TTTCTACTTCCTCG-5 622 Friday
8098 TTTGCATGAGAGGC-5 454 Friday
8099 TTTGCATGCCTCAC-5 724 Friday
[8100 rows x 3 columns]
var
is the same as in the single original Anndata objects (since we added more cells with all the same genes):
[27]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
print(
exp.ms["RNA"]
.var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
.concat()
.to_pandas()
)
soma_joinid var_id gene_ids
0 0 MIR1302-10 ENSG00000243485
1 1 FAM138A ENSG00000237613
2 2 OR4F5 ENSG00000186092
3 3 RP11-34P13.7 ENSG00000238009
4 4 RP11-34P13.8 ENSG00000239945
... ... ... ...
32733 32733 AC145205.1 ENSG00000215635
32734 32734 BAGE5 ENSG00000268590
32735 32735 CU459201.1 ENSG00000251180
32736 32736 AC002321.2 ENSG00000215616
32737 32737 AC002321.1 ENSG00000215611
[32738 rows x 3 columns]
Finally, the X
expression matrix contains 3x the entries as the original, but is also the same width (since we didn’t introduce new genes):
[28]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
X = exp.ms["RNA"].X["data"]
print(X.read().tables().concat().to_pandas())
print()
print(X.shape)
soma_dim_0 soma_dim_1 soma_data
0 0 70 20.0
1 0 166 20.0
2 0 178 40.0
3 0 326 20.0
4 0 363 20.0
... ... ... ...
6860647 8099 32697 40.0
6860648 8099 32698 280.0
6860649 8099 32702 40.0
6860650 8099 32705 40.0
6860651 8099 32708 120.0
[6860652 rows x 3 columns]
(8100, 32738)