Tutorial: TileDB-SOMA append mode

As of TileDB-SOMA 1.5.0, we’re excited to offer support for append mode.

As of TileDB-SOMA 1.15.0, we’re proud to offer a shape feature for dataframes and arrays within experimentts which more closely matches user expectations.

Use-cases include ingesting H5AD/AnnData from multiple sequencing runs over time, accumulating the data over time, into millions of cells.

First, we’ll do the usual package imports:

[3]:
import scanpy as sc

import tiledbsoma
import tiledbsoma.io
import tiledbsoma.logging

tiledbsoma.show_package_versions()
tiledbsoma.__version__              1.16.0
TileDB core version (libtiledbsoma) 2.27.2
python version                      3.11.10.final.0
OS version                          Darwin 24.3.0

Next we’ll set up where our data are going:

[4]:
import datetime

stamp = datetime.datetime.today().strftime("%Y%m%d-%H%M%S")
experiment_uri = f"/tmp/append-example-{stamp}"
experiment_uri
[4]:
'/tmp/append-example-20250307-175702'

For this demo, we’re writing to /tmp, but URIs like the following allow storing data on TileDB Cloud, cloud storage such as S3, or instance-local NVME:

  • /var/data/mysoma1

  • s3://mybucket/mysoma2

  • tiledb://mynamespace/s3://mybucket/mysoma3

Everything in this notebook below this URI-selection cell is agnostic to the storage backend.

Create the initial SOMA Experiment

Next we’ll prep some input data. To make things easy for this self-contained demo, we’ll use Scanpy’s pbmc3k, with a custom column.

[5]:
ad1 = sc.datasets.pbmc3k()
sc.pp.calculate_qc_metrics(ad1, inplace=True)
ad1.obs["when"] = ["Monday"] * len(ad1.obs)

Now we’re ready to ingest the data into a SOMA experiment. Since SOMA is multimodal, we’ll specify the destination modality, or measurement name, to be “RNA”.

[6]:
measurement_name = "RNA"
[7]:
tiledbsoma.logging.info()
tiledbsoma.io.from_anndata(experiment_uri, ad1, measurement_name=measurement_name)
Registration: registering isolated AnnData object.
Wrote   /tmp/append-example-20250307-175702/obs
Wrote   /tmp/append-example-20250307-175702/ms/RNA/var
Writing /tmp/append-example-20250307-175702/ms/RNA/X/data
Wrote   /tmp/append-example-20250307-175702/ms/RNA/X/data
Wrote   /tmp/append-example-20250307-175702
[7]:
'/tmp/append-example-20250307-175702'

Now let’s read back the data. We’ll take a look at obs, var, and X.

obs: For this initial ingest, there are obs IDs ending in -1, the when is Monday, and there are 2700 rows. Also note that since TileDB is a columnar database, when we select certain columns, those are the only ones loaded from disk. This positively impacts performance at cloud scale: you get what you asked for, without needing to wait for what you didn’t ask for.

[8]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
        .concat()
        .to_pandas()
    )
                obs_id  n_genes_by_counts    when
0     AAACATACAACCAC-1                781  Monday
1     AAACATTGAGCTAC-1               1352  Monday
2     AAACATTGATCAGC-1               1131  Monday
3     AAACCGTGCTTCCG-1                960  Monday
4     AAACCGTGTATGCG-1                522  Monday
...                ...                ...     ...
2695  TTTCGAACTCTCAT-1               1155  Monday
2696  TTTCTACTGAGGCA-1               1227  Monday
2697  TTTCTACTTCCTCG-1                622  Monday
2698  TTTGCATGAGAGGC-1                454  Monday
2699  TTTGCATGCCTCAC-1                724  Monday

[2700 rows x 3 columns]

var: Let’s also look at var, selecting out the join IDs (which index columns of X) as well as the Ensembl-format and NCBI-format gene IDs:

[9]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.ms["RNA"]
        .var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
        .concat()
        .to_pandas()
    )
       soma_joinid        var_id         gene_ids
0                0    MIR1302-10  ENSG00000243485
1                1       FAM138A  ENSG00000237613
2                2         OR4F5  ENSG00000186092
3                3  RP11-34P13.7  ENSG00000238009
4                4  RP11-34P13.8  ENSG00000239945
...            ...           ...              ...
32733        32733    AC145205.1  ENSG00000215635
32734        32734         BAGE5  ENSG00000268590
32735        32735    CU459201.1  ENSG00000251180
32736        32736    AC002321.2  ENSG00000215616
32737        32737    AC002321.1  ENSG00000215611

[32738 rows x 3 columns]

X: Lastly let’s look at the expression matrix, in COO format. (You can convert to other formats if you like.) Its rows and columns are indexed by the soma_joinid of the obs and var dataframes, respectively.

[10]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    X = exp.ms["RNA"].X["data"]
    print(X.read().tables().concat().to_pandas())
    print()
    print(X.shape)
         soma_dim_0  soma_dim_1  soma_data
0                 0          70        1.0
1                 0         166        1.0
2                 0         178        2.0
3                 0         326        1.0
4                 0         363        1.0
...             ...         ...        ...
2286879        2699       32697        1.0
2286880        2699       32698        7.0
2286881        2699       32702        1.0
2286882        2699       32705        1.0
2286883        2699       32708        3.0

[2286884 rows x 3 columns]

(2700, 32738)

While you can ask all dataframes and arrays in the experiment for their .domain or .shape, respectively, one at a time, there’s also the handy show_experiment_shape which traverses the experiment for you.

The dataframe domains and array shapes are soft limits on what values can be read from, or written to, them.

[11]:
tiledbsoma.io.show_experiment_shapes(experiment_uri)

[DataFrame] obs
  URI file:///tmp/append-example-20250307-175702/obs
  count                2700
  non_empty_domain     ((0, 2699),)
  domain               ((0, 2699),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[DataFrame] ms/RNA/var
  URI file:///tmp/append-example-20250307-175702/ms/RNA/var
  count                32738
  non_empty_domain     ((0, 32737),)
  domain               ((0, 32737),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[SparseNDArray] ms/RNA/X/data
  URI file:///tmp/append-example-20250307-175702/ms/RNA/X/data
  non_empty_domain     ((0, 2699), (5, 32732))
  shape                (2700, 32738)
  maxshape             (9223372036854773759, 9223372036854773759)
  upgraded             True
[11]:
True

Appending a new dataset to the SOMA Experiment

Now, let’s simulate another day’s sequencing run. For simplicity of this demo notebook, we’ll mutate the previous dataset, changing the obs IDs to have a -2 suffix, and also putting Tuesday in the when column. Also, we’ll multiply the X values by 10.

[12]:
ad2 = ad1.copy()
ad2.obs.index = [e.replace("-1", "-2") for e in ad1.obs.index]
ad2.obs["when"] = ["Tuesday"] * len(ad2.obs)
[13]:
ad2.X *= 10

Now we simply ingest as before – the only additional step is a black-box registration step which detects which cell IDs are new (here, all of them) and which gene IDs are new (here, none of them).

The registration takes two forms, either of which you can use depending on your use-case: tiledbsoma.io.register_anndatas for in-memory AnnData objects, or tiledbsoma.io.register_h5ads for on-storage AnnData objects.

[14]:
rd = tiledbsoma.io.register_anndatas(
    experiment_uri,
    [ad2],
    measurement_name=measurement_name,
    obs_field_name="obs_id",
    var_field_name="var_id",
)
Registration: starting with experiment /tmp/append-example-20250307-175702
Registration: found nobs=2700 nvar=32738 from experiment.
Registration: registering AnnData object.
Registration: accumulated to nobs=5400 nvar=32738.
Registration: complete.

As described on in the tutorial on the TileDB-SOMA shape feature, the domain of dataframes and the shape of N-dimensional arrays are soft limits on what values can be read from or written to. In order to ingest more data, we’ll need to increase those soft limits.

First let’s look at what they currently are:

[15]:
tiledbsoma.io.show_experiment_shapes(exp.uri)

[DataFrame] obs
  URI file:///tmp/append-example-20250307-175702/obs
  count                2700
  non_empty_domain     ((0, 2699),)
  domain               ((0, 2699),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[DataFrame] ms/RNA/var
  URI file:///tmp/append-example-20250307-175702/ms/RNA/var
  count                32738
  non_empty_domain     ((0, 32737),)
  domain               ((0, 32737),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[SparseNDArray] ms/RNA/X/data
  URI file:///tmp/append-example-20250307-175702/ms/RNA/X/data
  non_empty_domain     ((0, 2699), (5, 32732))
  shape                (2700, 32738)
  maxshape             (9223372036854773759, 9223372036854773759)
  upgraded             True
[15]:
True

Then we apply the resize, and look at the domains and shapes again:

[16]:
tiledbsoma.io.resize_experiment(exp.uri, nobs=rd.get_obs_shape(), nvars=rd.get_var_shapes())
[16]:
True
[17]:
tiledbsoma.io.show_experiment_shapes(exp.uri)

[DataFrame] obs
  URI file:///tmp/append-example-20250307-175702/obs
  count                2700
  non_empty_domain     ((0, 2699),)
  domain               ((0, 5399),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[DataFrame] ms/RNA/var
  URI file:///tmp/append-example-20250307-175702/ms/RNA/var
  count                32738
  non_empty_domain     ((0, 32737),)
  domain               ((0, 32737),)
  maxdomain            ((0, 9223372036854773758),)
  upgraded             True

[SparseNDArray] ms/RNA/X/data
  URI file:///tmp/append-example-20250307-175702/ms/RNA/X/data
  non_empty_domain     ((0, 2699), (5, 32732))
  shape                (5400, 32738)
  maxshape             (9223372036854773759, 9223372036854773759)
  upgraded             True
[17]:
True

Now we can ingest the new data:

[18]:
tiledbsoma.io.from_anndata(
    experiment_uri,
    ad2,
    measurement_name=measurement_name,
    registration_mapping=rd,
)
Wrote   /tmp/append-example-20250307-175702/obs
Wrote   /tmp/append-example-20250307-175702/ms/RNA/var
Writing /tmp/append-example-20250307-175702/ms/RNA/X/data
Wrote   /tmp/append-example-20250307-175702/ms/RNA/X/data
Wrote   file:///tmp/append-example-20250307-175702
[18]:
'file:///tmp/append-example-20250307-175702'

Now let’s read back the appended data. There are now obs IDs ending in -1 as well as -2, the when includes Monday as well as Tuesday, and there are 5400 rows.

(For Wednesday and onward, it’ll simply be the same pattern – we can grow our data iteratively over time, to arbitrary sizes.)

[19]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
        .concat()
        .to_pandas()
    )
                obs_id  n_genes_by_counts     when
0     AAACATACAACCAC-1                781   Monday
1     AAACATTGAGCTAC-1               1352   Monday
2     AAACATTGATCAGC-1               1131   Monday
3     AAACCGTGCTTCCG-1                960   Monday
4     AAACCGTGTATGCG-1                522   Monday
...                ...                ...      ...
5395  TTTCGAACTCTCAT-2               1155  Tuesday
5396  TTTCTACTGAGGCA-2               1227  Tuesday
5397  TTTCTACTTCCTCG-2                622  Tuesday
5398  TTTGCATGAGAGGC-2                454  Tuesday
5399  TTTGCATGCCTCAC-2                724  Tuesday

[5400 rows x 3 columns]

Let’s also look at var, as before. Since we had data for more cells but for the same genes, there is nothing new here. The obs table grew downward with the new cells, and X grew downward with new rows, but var stayed the same.

In real-world data, occasionally you will see a gene expressed in subsequent data which wasn’t expressed in the initial data. That’s fine – you’ll simply see var grow just a bit for those newly encountered gene IDs, with corresponding new columns for X.

[20]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.ms["RNA"]
        .var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
        .concat()
        .to_pandas()
    )
       soma_joinid        var_id         gene_ids
0                0    MIR1302-10  ENSG00000243485
1                1       FAM138A  ENSG00000237613
2                2         OR4F5  ENSG00000186092
3                3  RP11-34P13.7  ENSG00000238009
4                4  RP11-34P13.8  ENSG00000239945
...            ...           ...              ...
32733        32733    AC145205.1  ENSG00000215635
32734        32734         BAGE5  ENSG00000268590
32735        32735    CU459201.1  ENSG00000251180
32736        32736    AC002321.2  ENSG00000215616
32737        32737    AC002321.1  ENSG00000215611

[32738 rows x 3 columns]

And lastly, the X expression matrix which has grown downward with the new cells, while keeping the same width as we didn’t introduce new genes:

[21]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    X = exp.ms["RNA"].X["data"]
    print(X.read().tables().concat().to_pandas())
    print()
    print(X.shape)
         soma_dim_0  soma_dim_1  soma_data
0                 0          70        1.0
1                 0         166        1.0
2                 0         178        2.0
3                 0         326        1.0
4                 0         363        1.0
...             ...         ...        ...
4573763        5399       32697       10.0
4573764        5399       32698       70.0
4573765        5399       32702       10.0
4573766        5399       32705       10.0
4573767        5399       32708       30.0

[4573768 rows x 3 columns]

(5400, 32738)

Ingesting multiple datasets to a SOMA Experiment

Finally, we’ll demonstrate combining multiple AnnDatas into one new experiment.

The flow is pretty similar to the above:

  1. One call to register_anndatas or register_h5ads (passing all input AnnDatas/h5ads)

  2. One call to from_anndata/from_h5ad for each input AnnData

Here’s a helper function to simulate multiple lab runs. As above, where we used pbmc3k to simulate Monday and Tuesday data, here we use pbmc3k to simulate multiple AnnData objects.

[22]:
def make_ad(when, scale, obs_id_suffix):
    ad = ad1.copy()
    ad.obs.index = [e.replace("-1", obs_id_suffix) for e in ad.obs.index]
    ad.obs["when"] = [when] * len(ad.obs)
    ad.X *= scale
    return ad

ads = [
    make_ad(when, scale, f"-{idx + 3}")
    for idx, (when, scale)
    in enumerate({
        "Wednesday": 20,
        "Thursday": 30,
        "Friday": 40,
    }.items())
]

We’ll ingest these AnnData objects, as before, but this time to a fresh/empty /tmp location:

[23]:
stamp = datetime.datetime.today().strftime("%Y%m%d-%H%M%S")
exp = None
experiment_uri = f"/tmp/append-example-{stamp}"
experiment_uri
[23]:
'/tmp/append-example-20250307-175704'

Here we’ll register all the AnnData objects. Note that the SOMA Experiment doesn’t exist yet, so we pass experiment_uri=None to signify that.

[24]:
rd2 = tiledbsoma.io.register_anndatas(
    experiment_uri=None,  # new Experiment, from scratch
    adatas=ads,
    measurement_name=measurement_name,
    obs_field_name="obs_id",
    var_field_name="var_id",
)
Registration: registering AnnData object.
Registration: accumulated to nobs=2700 nvar=32738.
Registration: registering AnnData object.
Registration: accumulated to nobs=5400 nvar=32738.
Registration: registering AnnData object.
Registration: accumulated to nobs=8100 nvar=32738.
Registration: complete.

Now that we’ve gotten the registrations for all the input AnnData objects, we can ingest them.

Note:

  • Here we ingest them sequentially, in the same order as above.

  • But we could also ingest them in any shuffled order.

  • Or we could have multiple workers in ingest them in parallel, one worker per AnnData object, as long as the registration data are passed to each worker.

[25]:
for ad in ads:
    if tiledbsoma.Experiment.exists(experiment_uri):
        tiledbsoma.io.resize_experiment(
            experiment_uri,
            nobs=rd2.get_obs_shape(),
            nvars=rd2.get_var_shapes()
        )

    tiledbsoma.io.from_anndata(
        experiment_uri,
        ad,
        measurement_name=measurement_name,
        registration_mapping=rd2,
    )
Wrote   /tmp/append-example-20250307-175704/obs
Wrote   /tmp/append-example-20250307-175704/ms/RNA/var
Writing /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote   /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote   /tmp/append-example-20250307-175704
Wrote   /tmp/append-example-20250307-175704/obs
Wrote   /tmp/append-example-20250307-175704/ms/RNA/var
Writing /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote   /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote   file:///tmp/append-example-20250307-175704
Wrote   /tmp/append-example-20250307-175704/obs
Wrote   /tmp/append-example-20250307-175704/ms/RNA/var
Writing /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote   /tmp/append-example-20250307-175704/ms/RNA/X/data
Wrote   file:///tmp/append-example-20250307-175704

Reading back the concatenated data, we see 2700 rows for each of {-3, -4, -5}:

[26]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.obs.read(column_names=["obs_id", "n_genes_by_counts", "when"])
        .concat()
        .to_pandas()
    )
                obs_id  n_genes_by_counts       when
0     AAACATACAACCAC-3                781  Wednesday
1     AAACATTGAGCTAC-3               1352  Wednesday
2     AAACATTGATCAGC-3               1131  Wednesday
3     AAACCGTGCTTCCG-3                960  Wednesday
4     AAACCGTGTATGCG-3                522  Wednesday
...                ...                ...        ...
8095  TTTCGAACTCTCAT-5               1155     Friday
8096  TTTCTACTGAGGCA-5               1227     Friday
8097  TTTCTACTTCCTCG-5                622     Friday
8098  TTTGCATGAGAGGC-5                454     Friday
8099  TTTGCATGCCTCAC-5                724     Friday

[8100 rows x 3 columns]

var is the same as in the single original Anndata objects (since we added more cells with all the same genes):

[27]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    print(
        exp.ms["RNA"]
        .var.read(column_names=["soma_joinid", "var_id", "gene_ids"])
        .concat()
        .to_pandas()
    )
       soma_joinid        var_id         gene_ids
0                0    MIR1302-10  ENSG00000243485
1                1       FAM138A  ENSG00000237613
2                2         OR4F5  ENSG00000186092
3                3  RP11-34P13.7  ENSG00000238009
4                4  RP11-34P13.8  ENSG00000239945
...            ...           ...              ...
32733        32733    AC145205.1  ENSG00000215635
32734        32734         BAGE5  ENSG00000268590
32735        32735    CU459201.1  ENSG00000251180
32736        32736    AC002321.2  ENSG00000215616
32737        32737    AC002321.1  ENSG00000215611

[32738 rows x 3 columns]

Finally, the X expression matrix contains 3x the entries as the original, but is also the same width (since we didn’t introduce new genes):

[28]:
with tiledbsoma.Experiment.open(experiment_uri) as exp:
    X = exp.ms["RNA"].X["data"]
    print(X.read().tables().concat().to_pandas())
    print()
    print(X.shape)
         soma_dim_0  soma_dim_1  soma_data
0                 0          70       20.0
1                 0         166       20.0
2                 0         178       40.0
3                 0         326       20.0
4                 0         363       20.0
...             ...         ...        ...
6860647        8099       32697       40.0
6860648        8099       32698      280.0
6860649        8099       32702       40.0
6860650        8099       32705       40.0
6860651        8099       32708      120.0

[6860652 rows x 3 columns]

(8100, 32738)