{ "cells": [ { "cell_type": "markdown", "id": "36a0b22b", "metadata": { "tags": [] }, "source": [ "# Tutorial: TileDB-SOMA append mode" ] }, { "cell_type": "markdown", "id": "69de8627", "metadata": { "tags": [] }, "source": [ "As of TileDB-SOMA 1.5.0, we're excited to offer support for append mode.\n", "\n", "As of TileDB-SOMA 1.15.0, we're proud to offer a `shape` feature for dataframes and arrays within experimentts which more closely matches user expectations.\n", "\n", "Use-cases include ingesting H5AD/AnnData from multiple sequencing runs over time, accumulating the data over time, into millions of cells." ] }, { "cell_type": "markdown", "id": "2a218461", "metadata": { "tags": [] }, "source": [ "First, we'll do the usual package imports:" ] }, { "cell_type": "code", "execution_count": 3, "id": "d6b81174", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tiledbsoma.__version__ 1.16.0\n", "TileDB core version (libtiledbsoma) 2.27.2\n", "python version 3.11.10.final.0\n", "OS version Darwin 24.3.0\n" ] } ], "source": [ "import scanpy as sc\n", "\n", "import tiledbsoma\n", "import tiledbsoma.io\n", "import tiledbsoma.logging\n", "\n", "tiledbsoma.show_package_versions()" ] }, { "cell_type": "markdown", "id": "a7a65011", "metadata": { "tags": [] }, "source": [ "Next we'll set up where our data are going:" ] }, { "cell_type": "code", "execution_count": 4, "id": "24108e1c", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'/tmp/append-example-20250307-175702'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import datetime\n", "\n", "stamp = datetime.datetime.today().strftime(\"%Y%m%d-%H%M%S\")\n", "experiment_uri = f\"/tmp/append-example-{stamp}\"\n", "experiment_uri" ] }, { "cell_type": "markdown", "id": "e835c440", "metadata": { "tags": [] }, "source": [ "For this demo, we're writing to `/tmp`, but URIs like the following allow storing data on TileDB Cloud, cloud storage such as S3, or instance-local NVME:\n", "\n", "- `/var/data/mysoma1`\n", "- `s3://mybucket/mysoma2`\n", "- `tiledb://mynamespace/s3://mybucket/mysoma3`\n", "\n", "Everything in this notebook below this URI-selection cell is agnostic to the storage backend." ] }, { "cell_type": "markdown", "id": "0ffee7b3", "metadata": { "tags": [] }, "source": [ "## Create the initial SOMA Experiment" ] }, { "cell_type": "markdown", "id": "bb3aa6c8", "metadata": { "tags": [] }, "source": [ "Next we'll prep some input data. To make things easy for this self-contained demo, we'll use Scanpy's `pbmc3k`, with a custom column." ] }, { "cell_type": "code", "execution_count": 5, "id": "fe0e7a46", "metadata": { "tags": [] }, "outputs": [], "source": [ "ad1 = sc.datasets.pbmc3k()\n", "sc.pp.calculate_qc_metrics(ad1, inplace=True)\n", "ad1.obs[\"when\"] = [\"Monday\"] * len(ad1.obs)" ] }, { "cell_type": "markdown", "id": "88af955c", "metadata": { "tags": [] }, "source": [ "Now we're ready to ingest the data into a SOMA experiment. Since SOMA is multimodal, we'll specify the destination modality, or measurement name, to be \"RNA\"." ] }, { "cell_type": "code", "execution_count": 6, "id": "10cbd82b", "metadata": { "tags": [] }, "outputs": [], "source": [ "measurement_name = \"RNA\"" ] }, { "cell_type": "code", "execution_count": 7, "id": "a7c7914f", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Registration: registering isolated AnnData object.\n", "Wrote /tmp/append-example-20250307-175702/obs\n", "Wrote /tmp/append-example-20250307-175702/ms/RNA/var\n", "Writing /tmp/append-example-20250307-175702/ms/RNA/X/data\n", "Wrote /tmp/append-example-20250307-175702/ms/RNA/X/data\n", "Wrote /tmp/append-example-20250307-175702\n" ] }, { "data": { "text/plain": [ "'/tmp/append-example-20250307-175702'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tiledbsoma.logging.info()\n", "tiledbsoma.io.from_anndata(experiment_uri, ad1, measurement_name=measurement_name)" ] }, { "cell_type": "markdown", "id": "2c993ff5", "metadata": { "tags": [] }, "source": [ "Now let's read back the data. We'll take a look at `obs`, `var`, and `X`." ] }, { "cell_type": "markdown", "id": "40c6b6f0", "metadata": { "tags": [] }, "source": [ "**obs**: For this initial ingest, there are obs IDs ending in `-1`, the `when` is `Monday`, and there are 2700 rows. Also note that since TileDB is a columnar database, when we select certain columns, those are the only ones loaded from disk. This positively impacts performance at cloud scale: you get what you asked for, without needing to wait for what you didn't ask for." ] }, { "cell_type": "code", "execution_count": 8, "id": "d6ca5c9e", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " obs_id n_genes_by_counts when\n", "0 AAACATACAACCAC-1 781 Monday\n", "1 AAACATTGAGCTAC-1 1352 Monday\n", "2 AAACATTGATCAGC-1 1131 Monday\n", "3 AAACCGTGCTTCCG-1 960 Monday\n", "4 AAACCGTGTATGCG-1 522 Monday\n", "... ... ... ...\n", "2695 TTTCGAACTCTCAT-1 1155 Monday\n", "2696 TTTCTACTGAGGCA-1 1227 Monday\n", "2697 TTTCTACTTCCTCG-1 622 Monday\n", "2698 TTTGCATGAGAGGC-1 454 Monday\n", "2699 TTTGCATGCCTCAC-1 724 Monday\n", "\n", "[2700 rows x 3 columns]\n" ] } ], "source": [ "with tiledbsoma.Experiment.open(experiment_uri) as exp:\n", " print(\n", " exp.obs.read(column_names=[\"obs_id\", \"n_genes_by_counts\", \"when\"])\n", " .concat()\n", " .to_pandas()\n", " )" ] }, { "cell_type": "markdown", "id": "b8610b39", "metadata": { "tags": [] }, "source": [ "**var**: Let's also look at `var`, selecting out the join IDs (which index columns of `X`) as well as the Ensembl-format and NCBI-format gene IDs:" ] }, { "cell_type": "code", "execution_count": 9, "id": "221c472f", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " soma_joinid var_id gene_ids\n", "0 0 MIR1302-10 ENSG00000243485\n", "1 1 FAM138A ENSG00000237613\n", "2 2 OR4F5 ENSG00000186092\n", "3 3 RP11-34P13.7 ENSG00000238009\n", "4 4 RP11-34P13.8 ENSG00000239945\n", "... ... ... ...\n", "32733 32733 AC145205.1 ENSG00000215635\n", "32734 32734 BAGE5 ENSG00000268590\n", "32735 32735 CU459201.1 ENSG00000251180\n", "32736 32736 AC002321.2 ENSG00000215616\n", "32737 32737 AC002321.1 ENSG00000215611\n", "\n", "[32738 rows x 3 columns]\n" ] } ], "source": [ "with tiledbsoma.Experiment.open(experiment_uri) as exp:\n", " print(\n", " exp.ms[\"RNA\"]\n", " .var.read(column_names=[\"soma_joinid\", \"var_id\", \"gene_ids\"])\n", " .concat()\n", " .to_pandas()\n", " )" ] }, { "cell_type": "markdown", "id": "2e74cd29", "metadata": { "tags": [] }, "source": [ "**X**: Lastly let's look at the expression matrix, in COO format. (You can convert to other formats if you like.) Its rows and columns are indexed by the `soma_joinid` of the `obs` and `var` dataframes, respectively." ] }, { "cell_type": "code", "execution_count": 10, "id": "69ba087b-cb5f-4851-8ee6-3a4d828d70a6", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " soma_dim_0 soma_dim_1 soma_data\n", "0 0 70 1.0\n", "1 0 166 1.0\n", "2 0 178 2.0\n", "3 0 326 1.0\n", "4 0 363 1.0\n", "... ... ... ...\n", "2286879 2699 32697 1.0\n", "2286880 2699 32698 7.0\n", "2286881 2699 32702 1.0\n", "2286882 2699 32705 1.0\n", "2286883 2699 32708 3.0\n", "\n", "[2286884 rows x 3 columns]\n", "\n", "(2700, 32738)\n" ] } ], "source": [ "with tiledbsoma.Experiment.open(experiment_uri) as exp:\n", " X = exp.ms[\"RNA\"].X[\"data\"]\n", " print(X.read().tables().concat().to_pandas())\n", " print()\n", " print(X.shape)" ] }, { "cell_type": "markdown", "id": "880b17b9-dc1b-4b29-8a4e-f9a983b27785", "metadata": { "tags": [] }, "source": [ "While you can ask all dataframes and arrays in the experiment for their `.domain` or `.shape`, respectively, one at a time, there's also the handy `show_experiment_shape` which traverses the experiment for you.\n", "\n", "The dataframe domains and array shapes are soft limits on what values can be read from, or written to, them." ] }, { "cell_type": "code", "execution_count": 11, "id": "851fb064-2f40-43ff-b6a5-402c194ba177", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "[DataFrame] obs \n", " URI file:///tmp/append-example-20250307-175702/obs\n", " count 2700\n", " non_empty_domain ((0, 2699),)\n", " domain ((0, 2699),)\n", " maxdomain ((0, 9223372036854773758),)\n", " upgraded True\n", "\n", "[DataFrame] ms/RNA/var \n", " URI file:///tmp/append-example-20250307-175702/ms/RNA/var\n", " count 32738\n", " non_empty_domain ((0, 32737),)\n", " domain ((0, 32737),)\n", " maxdomain ((0, 9223372036854773758),)\n", " upgraded True\n", "\n", "[SparseNDArray] ms/RNA/X/data \n", " URI file:///tmp/append-example-20250307-175702/ms/RNA/X/data\n", " non_empty_domain ((0, 2699), (5, 32732))\n", " shape (2700, 32738)\n", " maxshape (9223372036854773759, 9223372036854773759)\n", " upgraded True\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tiledbsoma.io.show_experiment_shapes(experiment_uri)" ] }, { "cell_type": "markdown", "id": "cd08018e", "metadata": { "tags": [] }, "source": [ "## Appending a new dataset to the SOMA Experiment" ] }, { "cell_type": "markdown", "id": "10f03631", "metadata": { "tags": [] }, "source": [ "Now, let's simulate another day's sequencing run. For simplicity of this demo notebook, we'll mutate the previous dataset, changing the obs IDs to have a `-2` suffix, and also putting `Tuesday` in the `when` column. Also, we'll multiply the `X` values by 10." ] }, { "cell_type": "code", "execution_count": 12, "id": "81ca4031", "metadata": { "tags": [] }, "outputs": [], "source": [ "ad2 = ad1.copy()\n", "ad2.obs.index = [e.replace(\"-1\", \"-2\") for e in ad1.obs.index]\n", "ad2.obs[\"when\"] = [\"Tuesday\"] * len(ad2.obs)" ] }, { "cell_type": "code", "execution_count": 13, "id": "d703ebb7", "metadata": { "tags": [] }, "outputs": [], "source": [ "ad2.X *= 10" ] }, { "cell_type": "markdown", "id": "90e85660", "metadata": { "tags": [] }, "source": [ "Now we simply ingest as before -- the only additional step is a black-box registration step which detects which cell IDs are new (here, all of them) and which gene IDs are new (here, none of them).\n", "\n", "The registration takes two forms, either of which you can use depending on your use-case: `tiledbsoma.io.register_anndatas` for in-memory AnnData objects, or `tiledbsoma.io.register_h5ads` for on-storage AnnData objects." ] }, { "cell_type": "code", "execution_count": 14, "id": "38f78883-6af5-43f9-8661-add028e3dee1", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Registration: starting with experiment /tmp/append-example-20250307-175702\n", "Registration: found nobs=2700 nvar=32738 from experiment.\n", "Registration: registering AnnData object.\n", "Registration: accumulated to nobs=5400 nvar=32738.\n", "Registration: complete.\n" ] } ], "source": [ "rd = tiledbsoma.io.register_anndatas(\n", " experiment_uri,\n", " [ad2],\n", " measurement_name=measurement_name,\n", " obs_field_name=\"obs_id\",\n", " var_field_name=\"var_id\",\n", ")" ] }, { "cell_type": "markdown", "id": "e880517d-b738-4d8e-a508-e84703eec5be", "metadata": { "tags": [] }, "source": [ "As described on in the tutorial on the TileDB-SOMA shape feature, the `domain` of dataframes and the `shape` of N-dimensional arrays are soft limits on what values can be read from or written to. In order to ingest more data, we'll need to increase those soft limits.\n", "\n", "First let's look at what they currently are:" ] }, { "cell_type": "code", "execution_count": 15, "id": "68a236ff-15fc-49f5-9285-bbd128aeae6b", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "[DataFrame] obs \n", " URI file:///tmp/append-example-20250307-175702/obs\n", " count 2700\n", " non_empty_domain ((0, 2699),)\n", " domain ((0, 2699),)\n", " maxdomain ((0, 9223372036854773758),)\n", " upgraded True\n", "\n", "[DataFrame] ms/RNA/var \n", " URI file:///tmp/append-example-20250307-175702/ms/RNA/var\n", " count 32738\n", " non_empty_domain ((0, 32737),)\n", " domain ((0, 32737),)\n", " maxdomain ((0, 9223372036854773758),)\n", " upgraded True\n", "\n", "[SparseNDArray] ms/RNA/X/data \n", " URI file:///tmp/append-example-20250307-175702/ms/RNA/X/data\n", " non_empty_domain ((0, 2699), (5, 32732))\n", " shape (2700, 32738)\n", " maxshape (9223372036854773759, 9223372036854773759)\n", " upgraded True\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tiledbsoma.io.show_experiment_shapes(exp.uri)" ] }, { "cell_type": "markdown", "id": "be11a416-c2e3-4aed-8edf-ce5f1efc777d", "metadata": { "tags": [] }, "source": [ "Then we apply the resize, and look at the domains and shapes again:" ] }, { "cell_type": "code", "execution_count": 16, "id": "09e8a10f-90c7-493c-9e91-26f09d96e527", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tiledbsoma.io.resize_experiment(exp.uri, nobs=rd.get_obs_shape(), nvars=rd.get_var_shapes())" ] }, { "cell_type": "code", "execution_count": 17, "id": "f183088d-b428-47ce-99f6-c12157867357", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "[DataFrame] obs \n", " URI file:///tmp/append-example-20250307-175702/obs\n", " count 2700\n", " non_empty_domain ((0, 2699),)\n", " domain ((0, 5399),)\n", " maxdomain ((0, 9223372036854773758),)\n", " upgraded True\n", "\n", "[DataFrame] ms/RNA/var \n", " URI file:///tmp/append-example-20250307-175702/ms/RNA/var\n", " count 32738\n", " non_empty_domain ((0, 32737),)\n", " domain ((0, 32737),)\n", " maxdomain ((0, 9223372036854773758),)\n", " upgraded True\n", "\n", "[SparseNDArray] ms/RNA/X/data \n", " URI file:///tmp/append-example-20250307-175702/ms/RNA/X/data\n", " non_empty_domain ((0, 2699), (5, 32732))\n", " shape (5400, 32738)\n", " maxshape (9223372036854773759, 9223372036854773759)\n", " upgraded True\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tiledbsoma.io.show_experiment_shapes(exp.uri)" ] }, { "cell_type": "markdown", "id": "bcd2c6bc-b7ec-41a3-933a-c477ee3ebf4b", "metadata": { "tags": [] }, "source": [ "Now we can ingest the new data:" ] }, { "cell_type": "code", "execution_count": 18, "id": "2f500c43-1e44-4a56-bf87-3ba936b973f7", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Wrote /tmp/append-example-20250307-175702/obs\n", "Wrote /tmp/append-example-20250307-175702/ms/RNA/var\n", "Writing /tmp/append-example-20250307-175702/ms/RNA/X/data\n", "Wrote /tmp/append-example-20250307-175702/ms/RNA/X/data\n", "Wrote file:///tmp/append-example-20250307-175702\n" ] }, { "data": { "text/plain": [ "'file:///tmp/append-example-20250307-175702'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tiledbsoma.io.from_anndata(\n", " experiment_uri,\n", " ad2,\n", " measurement_name=measurement_name,\n", " registration_mapping=rd,\n", ")" ] }, { "cell_type": "markdown", "id": "53d07733", "metadata": { "tags": [] }, "source": [ "Now let's read back the appended data. There are now obs IDs ending in `-1` as well as `-2`, the `when` includes `Monday` as well as `Tuesday`, and there are 5400 rows.\n", "\n", "(For `Wednesday` and onward, it'll simply be the same pattern -- we can grow our data iteratively over time, to arbitrary sizes.)" ] }, { "cell_type": "code", "execution_count": 19, "id": "a7b2aebe", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " obs_id n_genes_by_counts when\n", "0 AAACATACAACCAC-1 781 Monday\n", "1 AAACATTGAGCTAC-1 1352 Monday\n", "2 AAACATTGATCAGC-1 1131 Monday\n", "3 AAACCGTGCTTCCG-1 960 Monday\n", "4 AAACCGTGTATGCG-1 522 Monday\n", "... ... ... ...\n", "5395 TTTCGAACTCTCAT-2 1155 Tuesday\n", "5396 TTTCTACTGAGGCA-2 1227 Tuesday\n", "5397 TTTCTACTTCCTCG-2 622 Tuesday\n", "5398 TTTGCATGAGAGGC-2 454 Tuesday\n", "5399 TTTGCATGCCTCAC-2 724 Tuesday\n", "\n", "[5400 rows x 3 columns]\n" ] } ], "source": [ "with tiledbsoma.Experiment.open(experiment_uri) as exp:\n", " print(\n", " exp.obs.read(column_names=[\"obs_id\", \"n_genes_by_counts\", \"when\"])\n", " .concat()\n", " .to_pandas()\n", " )" ] }, { "cell_type": "markdown", "id": "c8bc7cd2", "metadata": { "tags": [] }, "source": [ "Let's also look at `var`, as before. Since we had data for more cells but for the same genes, there is nothing new here. The `obs` table grew downward with the new cells, and `X` grew downward with new rows, but `var` stayed the same.\n", "\n", "In real-world data, occasionally you will see a gene expressed in subsequent data which wasn't expressed in the initial data. That's fine -- you'll simply see `var` grow just a bit for those newly encountered gene IDs, with corresponding new columns for `X`." ] }, { "cell_type": "code", "execution_count": 20, "id": "4a1cc20e", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " soma_joinid var_id gene_ids\n", "0 0 MIR1302-10 ENSG00000243485\n", "1 1 FAM138A ENSG00000237613\n", "2 2 OR4F5 ENSG00000186092\n", "3 3 RP11-34P13.7 ENSG00000238009\n", "4 4 RP11-34P13.8 ENSG00000239945\n", "... ... ... ...\n", "32733 32733 AC145205.1 ENSG00000215635\n", "32734 32734 BAGE5 ENSG00000268590\n", "32735 32735 CU459201.1 ENSG00000251180\n", "32736 32736 AC002321.2 ENSG00000215616\n", "32737 32737 AC002321.1 ENSG00000215611\n", "\n", "[32738 rows x 3 columns]\n" ] } ], "source": [ "with tiledbsoma.Experiment.open(experiment_uri) as exp:\n", " print(\n", " exp.ms[\"RNA\"]\n", " .var.read(column_names=[\"soma_joinid\", \"var_id\", \"gene_ids\"])\n", " .concat()\n", " .to_pandas()\n", " )" ] }, { "cell_type": "markdown", "id": "499785d6", "metadata": { "tags": [] }, "source": [ "And lastly, the `X` expression matrix which has grown downward with the new cells, while keeping the same width as we didn't introduce new genes:" ] }, { "cell_type": "code", "execution_count": 21, "id": "d640bde0", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " soma_dim_0 soma_dim_1 soma_data\n", "0 0 70 1.0\n", "1 0 166 1.0\n", "2 0 178 2.0\n", "3 0 326 1.0\n", "4 0 363 1.0\n", "... ... ... ...\n", "4573763 5399 32697 10.0\n", "4573764 5399 32698 70.0\n", "4573765 5399 32702 10.0\n", "4573766 5399 32705 10.0\n", "4573767 5399 32708 30.0\n", "\n", "[4573768 rows x 3 columns]\n", "\n", "(5400, 32738)\n" ] } ], "source": [ "with tiledbsoma.Experiment.open(experiment_uri) as exp:\n", " X = exp.ms[\"RNA\"].X[\"data\"]\n", " print(X.read().tables().concat().to_pandas())\n", " print()\n", " print(X.shape)" ] }, { "cell_type": "markdown", "id": "290da7f2", "metadata": { "tags": [] }, "source": [ "## Ingesting multiple datasets to a SOMA Experiment" ] }, { "cell_type": "markdown", "id": "5d812c64", "metadata": { "tags": [] }, "source": [ "Finally, we'll demonstrate combining multiple AnnDatas into one new experiment.\n", "\n", "The flow is pretty similar to the above:\n", "\n", "1. One call to `register_anndatas` or `register_h5ads` (passing all input AnnDatas/h5ads)\n", "2. One call to `from_anndata`/`from_h5ad` *for each input AnnData*\n", "\n", "Here's a helper function to simulate multiple lab runs. As above, where we used `pbmc3k` to simulate Monday and Tuesday data, here we use `pbmc3k` to simulate multiple AnnData objects." ] }, { "cell_type": "code", "execution_count": 22, "id": "c3c185fb", "metadata": { "tags": [] }, "outputs": [], "source": [ "def make_ad(when, scale, obs_id_suffix):\n", " ad = ad1.copy()\n", " ad.obs.index = [e.replace(\"-1\", obs_id_suffix) for e in ad.obs.index]\n", " ad.obs[\"when\"] = [when] * len(ad.obs)\n", " ad.X *= scale\n", " return ad\n", "\n", "ads = [\n", " make_ad(when, scale, f\"-{idx + 3}\")\n", " for idx, (when, scale)\n", " in enumerate({\n", " \"Wednesday\": 20,\n", " \"Thursday\": 30,\n", " \"Friday\": 40,\n", " }.items())\n", "]" ] }, { "cell_type": "markdown", "id": "7da62a10", "metadata": { "tags": [] }, "source": [ "We'll ingest these AnnData objects, as before, but this time to a fresh/empty `/tmp` location:" ] }, { "cell_type": "code", "execution_count": 23, "id": "ae2d62ae", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'/tmp/append-example-20250307-175704'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stamp = datetime.datetime.today().strftime(\"%Y%m%d-%H%M%S\")\n", "exp = None\n", "experiment_uri = f\"/tmp/append-example-{stamp}\"\n", "experiment_uri" ] }, { "cell_type": "markdown", "id": "b89fe0ee", "metadata": { "tags": [] }, "source": [ "Here we'll register all the AnnData objects. Note that the SOMA Experiment doesn't exist yet, so we pass `experiment_uri=None` to signify that." ] }, { "cell_type": "code", "execution_count": 24, "id": "ac21dd19-2fd5-41ec-98e5-2596e0795f0d", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Registration: registering AnnData object.\n", "Registration: accumulated to nobs=2700 nvar=32738.\n", "Registration: registering AnnData object.\n", "Registration: accumulated to nobs=5400 nvar=32738.\n", "Registration: registering AnnData object.\n", "Registration: accumulated to nobs=8100 nvar=32738.\n", "Registration: complete.\n" ] } ], "source": [ "rd2 = tiledbsoma.io.register_anndatas(\n", " experiment_uri=None, # new Experiment, from scratch\n", " adatas=ads,\n", " measurement_name=measurement_name,\n", " obs_field_name=\"obs_id\",\n", " var_field_name=\"var_id\",\n", ")" ] }, { "cell_type": "markdown", "id": "77429cf0", "metadata": { "tags": [] }, "source": [ "Now that we've gotten the registrations for all the input AnnData objects, we can ingest them.\n", "\n", "Note:\n", "\n", "- Here we ingest them sequentially, in the same order as above.\n", "- But we could also ingest them in any shuffled order.\n", "- Or we could have multiple workers in ingest them in parallel, one worker per AnnData object, as long as the registration data are passed to each worker." ] }, { "cell_type": "code", "execution_count": 25, "id": "27ed22b2", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Wrote /tmp/append-example-20250307-175704/obs\n", "Wrote /tmp/append-example-20250307-175704/ms/RNA/var\n", "Writing /tmp/append-example-20250307-175704/ms/RNA/X/data\n", "Wrote /tmp/append-example-20250307-175704/ms/RNA/X/data\n", "Wrote /tmp/append-example-20250307-175704\n", "Wrote /tmp/append-example-20250307-175704/obs\n", "Wrote /tmp/append-example-20250307-175704/ms/RNA/var\n", "Writing /tmp/append-example-20250307-175704/ms/RNA/X/data\n", "Wrote /tmp/append-example-20250307-175704/ms/RNA/X/data\n", "Wrote file:///tmp/append-example-20250307-175704\n", "Wrote /tmp/append-example-20250307-175704/obs\n", "Wrote /tmp/append-example-20250307-175704/ms/RNA/var\n", "Writing /tmp/append-example-20250307-175704/ms/RNA/X/data\n", "Wrote /tmp/append-example-20250307-175704/ms/RNA/X/data\n", "Wrote file:///tmp/append-example-20250307-175704\n" ] } ], "source": [ "for ad in ads:\n", " if tiledbsoma.Experiment.exists(experiment_uri):\n", " tiledbsoma.io.resize_experiment(\n", " experiment_uri,\n", " nobs=rd2.get_obs_shape(),\n", " nvars=rd2.get_var_shapes()\n", " )\n", "\n", " tiledbsoma.io.from_anndata(\n", " experiment_uri,\n", " ad,\n", " measurement_name=measurement_name,\n", " registration_mapping=rd2,\n", " )" ] }, { "cell_type": "markdown", "id": "e2e54f89", "metadata": { "tags": [] }, "source": [ "Reading back the concatenated data, we see 2700 rows for each of {`-3`, `-4`, `-5`}:" ] }, { "cell_type": "code", "execution_count": 26, "id": "8f86fd3d", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " obs_id n_genes_by_counts when\n", "0 AAACATACAACCAC-3 781 Wednesday\n", "1 AAACATTGAGCTAC-3 1352 Wednesday\n", "2 AAACATTGATCAGC-3 1131 Wednesday\n", "3 AAACCGTGCTTCCG-3 960 Wednesday\n", "4 AAACCGTGTATGCG-3 522 Wednesday\n", "... ... ... ...\n", "8095 TTTCGAACTCTCAT-5 1155 Friday\n", "8096 TTTCTACTGAGGCA-5 1227 Friday\n", "8097 TTTCTACTTCCTCG-5 622 Friday\n", "8098 TTTGCATGAGAGGC-5 454 Friday\n", "8099 TTTGCATGCCTCAC-5 724 Friday\n", "\n", "[8100 rows x 3 columns]\n" ] } ], "source": [ "with tiledbsoma.Experiment.open(experiment_uri) as exp:\n", " print(\n", " exp.obs.read(column_names=[\"obs_id\", \"n_genes_by_counts\", \"when\"])\n", " .concat()\n", " .to_pandas()\n", " )" ] }, { "cell_type": "markdown", "id": "4f8596b0", "metadata": { "tags": [] }, "source": [ "`var` is the same as in the single original Anndata objects (since we added more cells with all the same genes):" ] }, { "cell_type": "code", "execution_count": 27, "id": "bffce533", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " soma_joinid var_id gene_ids\n", "0 0 MIR1302-10 ENSG00000243485\n", "1 1 FAM138A ENSG00000237613\n", "2 2 OR4F5 ENSG00000186092\n", "3 3 RP11-34P13.7 ENSG00000238009\n", "4 4 RP11-34P13.8 ENSG00000239945\n", "... ... ... ...\n", "32733 32733 AC145205.1 ENSG00000215635\n", "32734 32734 BAGE5 ENSG00000268590\n", "32735 32735 CU459201.1 ENSG00000251180\n", "32736 32736 AC002321.2 ENSG00000215616\n", "32737 32737 AC002321.1 ENSG00000215611\n", "\n", "[32738 rows x 3 columns]\n" ] } ], "source": [ "with tiledbsoma.Experiment.open(experiment_uri) as exp:\n", " print(\n", " exp.ms[\"RNA\"]\n", " .var.read(column_names=[\"soma_joinid\", \"var_id\", \"gene_ids\"])\n", " .concat()\n", " .to_pandas()\n", " )" ] }, { "cell_type": "markdown", "id": "9a9737a0", "metadata": { "tags": [] }, "source": [ "Finally, the `X` expression matrix contains 3x the entries as the original, but is also the same width (since we didn't introduce new genes):" ] }, { "cell_type": "code", "execution_count": 28, "id": "05cf63a0", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " soma_dim_0 soma_dim_1 soma_data\n", "0 0 70 20.0\n", "1 0 166 20.0\n", "2 0 178 40.0\n", "3 0 326 20.0\n", "4 0 363 20.0\n", "... ... ... ...\n", "6860647 8099 32697 40.0\n", "6860648 8099 32698 280.0\n", "6860649 8099 32702 40.0\n", "6860650 8099 32705 40.0\n", "6860651 8099 32708 120.0\n", "\n", "[6860652 rows x 3 columns]\n", "\n", "(8100, 32738)\n" ] } ], "source": [ "with tiledbsoma.Experiment.open(experiment_uri) as exp:\n", " X = exp.ms[\"RNA\"].X[\"data\"]\n", " print(X.read().tables().concat().to_pandas())\n", " print()\n", " print(X.shape)" ] } ], "metadata": { "kernelspec": { "display_name": "3.11.10", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 5 }