{ "cells": [ { "cell_type": "markdown", "id": "2cf7c05c-f723-489d-8c39-3e2841f655b0", "metadata": {}, "source": [ "# Tutorial: SOMA shapes\n", "\n", "As of TileDB-SOMA 1.15 we're proud to support a more intutive and extensible notion of `shape`.\n", "\n", "In this notebook, we'll go through how you use shapes for the dataframes and arrays within your SOMA experiments, when and how you can resize, and options for experiments created before TileDB-SOMA 1.15.\n", "\n", "The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics.\n", "\n", "(Please also see the [Academy tutorial](https://cloud.tiledb.com/academy/structure/life-sciences/single-cell/tutorials/shapes/).)" ] }, { "cell_type": "markdown", "id": "167dba53-7da6-4984-bbe7-a5416e60325d", "metadata": {}, "source": [ "We'll start by importing `tiledbsoma`." ] }, { "cell_type": "code", "execution_count": 3, "id": "90db6017-a084-43f5-8f7e-bff281e9a898", "metadata": { "tags": [] }, "outputs": [], "source": [ "import tiledbsoma" ] }, { "cell_type": "markdown", "id": "ca9f7272-09e0-4eda-a569-8796a14bf776", "metadata": { "tags": [] }, "source": [ "## The shape feature" ] }, { "cell_type": "markdown", "id": "41358011-b835-4c3a-a75e-79a80f4cc3a1", "metadata": {}, "source": [ "As we've seen in other tutorials in this series, the SOMA data model brings across many familiar concepts from AnnData. This includes the ability to ask component dataframes and arrays what their shapes are." ] }, { "cell_type": "markdown", "id": "86b5c5e4-c0b7-4426-b312-2d19e40aa454", "metadata": {}, "source": [ "First, let's unpack and open an experiment." ] }, { "cell_type": "code", "execution_count": 4, "id": "1e02d2b9-c492-4e02-9022-203a1d65282c", "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "import tempfile\n", "\n", "uri = tempfile.mktemp()\n", "with tarfile.open(\"data/pbmc3k-sparse.tgz\") as handle:\n", " handle.extractall(uri)\n", "exp = tiledbsoma.Experiment.open(uri)" ] }, { "cell_type": "markdown", "id": "2d934ed9-5b41-4af8-a737-23583f6e885b", "metadata": {}, "source": [ "The `obs` dataframe has a domain, which is a soft limit on what values can be written to it. You'll get an exception if you try to read or write `soma_joinid` values outside this range, which is an important data-integrity reassurance.\n", "\n", "The domain we see here matches with the data populated inside of it.\n", "\n", "(This will usually be the case. It might not, if you've created the dataframe but not written any data to it yet -- at that point it's empty but it still has a shape.)\n", "\n", "If you have more data -- more cells -- to add to the experiment later, you will be able resize the `obs`, up to the `maxdomain` which is a hard limit." ] }, { "cell_type": "code", "execution_count": 5, "id": "c90a840e-559f-4dfb-a9f8-5bcd629c714c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 2637),)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.obs.domain" ] }, { "cell_type": "code", "execution_count": 6, "id": "9a17cd6c-864d-4b83-915e-9ea67e042bab", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9223372036854773758),)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.obs.maxdomain" ] }, { "cell_type": "code", "execution_count": 7, "id": "9967e115-6277-4203-b61b-96d1c5b04fde", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
00AAACATACAACCAC-17810.0301782419.0CD4 T cells
11AAACATTGAGCTAC-113520.0379364903.0B cells
22AAACATTGATCAGC-111310.0088973147.0CD4 T cells
33AAACCGTGCTTCCG-19600.0174312639.0CD14+ Monocytes
44AAACCGTGTATGCG-15220.012245980.0NK cells
.....................
26332633TTTCGAACTCTCAT-111550.0211043459.0CD14+ Monocytes
26342634TTTCTACTGAGGCA-112270.0092943443.0B cells
26352635TTTCTACTTCCTCG-16220.0219711684.0B cells
26362636TTTGCATGAGAGGC-14540.0205481022.0B cells
26372637TTTGCATGCCTCAC-17240.0080651984.0CD4 T cells
\n", "

2638 rows × 6 columns

\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts \\\n", "0 0 AAACATACAACCAC-1 781 0.030178 2419.0 \n", "1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 \n", "2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 \n", "3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 \n", "4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 \n", "... ... ... ... ... ... \n", "2633 2633 TTTCGAACTCTCAT-1 1155 0.021104 3459.0 \n", "2634 2634 TTTCTACTGAGGCA-1 1227 0.009294 3443.0 \n", "2635 2635 TTTCTACTTCCTCG-1 622 0.021971 1684.0 \n", "2636 2636 TTTGCATGAGAGGC-1 454 0.020548 1022.0 \n", "2637 2637 TTTGCATGCCTCAC-1 724 0.008065 1984.0 \n", "\n", " louvain \n", "0 CD4 T cells \n", "1 B cells \n", "2 CD4 T cells \n", "3 CD14+ Monocytes \n", "4 NK cells \n", "... ... \n", "2633 CD14+ Monocytes \n", "2634 B cells \n", "2635 B cells \n", "2636 B cells \n", "2637 CD4 T cells \n", "\n", "[2638 rows x 6 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.obs.read().concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "3deadaeb-5ab5-4c9d-ba31-79ec1d36aace", "metadata": {}, "source": [ "We'll see more about this on experiment-level resizes below, as well as in the tutorial on TileDB-SOMA's append mode." ] }, { "cell_type": "markdown", "id": "52dcd26b-1de2-434e-8593-57d5583e4fdc", "metadata": {}, "source": [ "The `var` dataframe's domain is similar:" ] }, { "cell_type": "code", "execution_count": 8, "id": "882ab5f8-6fa7-4920-84bc-b72caf65ec09", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 1837),)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "var = exp.ms[\"RNA\"].var\n", "var.domain" ] }, { "cell_type": "code", "execution_count": 9, "id": "8685af65-62f4-4816-a713-55fc90e3c983", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9223372036854773968),)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "var.maxdomain" ] }, { "cell_type": "markdown", "id": "22fb5a8f-245e-4b8a-9090-376ba6209dd8", "metadata": {}, "source": [ "Likewise, the N-dimensional arrays within the experiment have their shapes as well.\n", "\n", "There's an important difference: while the dataframe domain gives you the inclusive lower and upper bounds for `soma_joinid` writes, the `shape` for the N-dimensional arrays is the upper bound plus 1.\n", "\n", "Since there are 2638 cells and 1838 genes here, `X`'s shape reflects that." ] }, { "cell_type": "code", "execution_count": 10, "id": "67e043fd-5173-44bf-8b57-811f32f4c85f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 2637),)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.obs.domain" ] }, { "cell_type": "code", "execution_count": 11, "id": "c32892ee-0c1e-4393-9ad9-620d8eb178ad", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 1837),)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.ms[\"RNA\"].var.domain" ] }, { "cell_type": "code", "execution_count": 12, "id": "cd67a1ac-ec1f-4a0a-adb3-b7c3f75592e7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2638, 1838)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.ms[\"RNA\"].X[\"data\"].shape" ] }, { "cell_type": "code", "execution_count": 13, "id": "98f6c5f2-d002-428d-9cfe-3c817764199a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9223372036854773759, 9223372036854773759)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.ms[\"RNA\"].X[\"data\"].maxshape" ] }, { "cell_type": "markdown", "id": "f288b733-8415-4b73-8237-08f5704f0586", "metadata": {}, "source": [ "The other N-dimensional arrays are similar:" ] }, { "cell_type": "code", "execution_count": 14, "id": "96523a2b-10d3-4e2c-b76e-1b6e814c8774", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['X_draw_graph_fr', 'X_pca', 'X_tsne', 'X_umap']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obsm = exp.ms[\"RNA\"].obsm\n", "list(obsm.keys())" ] }, { "cell_type": "code", "execution_count": 15, "id": "82b16ded-298c-4d7e-8dfd-ffb4b36c37c6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['connectivities', 'distances']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obsp = exp.ms[\"RNA\"].obsp\n", "list(obsp.keys())" ] }, { "cell_type": "code", "execution_count": 16, "id": "bbe56e08-c237-48df-8b0d-393057e7e6fa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(2638, 50), (9223372036854773759, 9223372036854773759)]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[\n", " obsm[\"X_pca\"].shape,\n", " obsm[\"X_pca\"].maxshape,\n", "]" ] }, { "cell_type": "code", "execution_count": 17, "id": "7577221c-85c7-4549-847d-8cbcc1b771ab", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(2638, 2638), (9223372036854773759, 9223372036854773759)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[\n", " obsp[\"distances\"].shape,\n", " obsp[\"distances\"].maxshape,\n", "]" ] }, { "cell_type": "markdown", "id": "f7c0e4bb-30fd-4b24-9231-e72c79d0a1c2", "metadata": {}, "source": [ "In particular, the `X` array in this experiment -- and in most experiments -- is _sparse_. That means there needn't be a number in every row or cell of the matrix. Nonetheless, the shape serves as a soft limit for reads and writes: you'll get an exception trying to read or write outside of these." ] }, { "cell_type": "markdown", "id": "aecfff79-d5ac-4361-ba56-0c5cad05206d", "metadata": {}, "source": [ "As a convenience, you can see all the experiment's objects' shapes at once as follows:\n", "\n", "```\n", "import tiledbsoma.io\n", "tiledbsoma.io.show_experiment_shapes(exp.uri)\n", "```" ] }, { "cell_type": "markdown", "id": "0836f330-6cfd-4d88-9779-ce48d1e90e82", "metadata": {}, "source": [ "As with AnnData, as a general rule you'll see the following:\n", "\n", "* An `X` array's `shape` is `nobs` x `nvar`\n", "* An `obsm` array's shape is `nobs` x some number, maybe 50\n", "* An `obsp` array's shape is `nobs` x `nobs`\n", "* A `varm` array's shape is `var` x some number, maybe 50\n", "* A `varp` array's shape is `nvar` x `nvar`" ] }, { "cell_type": "markdown", "id": "c33a9424-9515-4f9b-b191-f50cca39dec2", "metadata": {}, "source": [ "## When and how to resize at the experiment level" ] }, { "cell_type": "markdown", "id": "44df2aea-8480-430f-adf9-eeff960a562f", "metadata": {}, "source": [ "The primary reason you'd resize a dataframe or an array within an experiment is to append more data. For example, say you have an experiment with the results of Monday's lab run on a sample of 100,000 cells. Then maybe on Tuesday you'll want to add that day's lab run of an additional 70,000 cells to the same experiment, for a new total of 170,000 cells. It's also possible that Tuesday's data might include some infrequently expressed genes that didn't appear in Monday's data.\n", "\n", "Because the shapes are soft limits, reading or writing beyond which will result in an exception, you'd need to resize the experiment to accommodate new shapes for the dataframes and arrays in the experiment to allow for new `nobs` = 170,000.\n", "\n", "Please see the [append-mode tutorial](./tutorial_soma_append_mode.ipynb) for how to do that using `tiledbsoma.io.register_anndatas` and `tiledbsoma.io.resize_experiment`\n", "\n", "While you can resize each dataframe and array in the experiment one at a time -- see \"Advanced usage\", below in this notebook -- by var the most common case is `tiledbsoma.io.resize_experiment`, which exists to make this simple and convenient." ] }, { "cell_type": "markdown", "id": "b50cd522-ded1-4dd8-86ec-ea7c7e8f5421", "metadata": {}, "source": [ "## How to upgrade older experiments" ] }, { "cell_type": "markdown", "id": "a397a2ff-5e9d-470f-a5e3-c0dd2fb6d731", "metadata": {}, "source": [ "Experiments created by TileDB-SOMA 1.15 and higher will look as shown above. Let's take a look at an experiment from before TileDB-SOMA 1.15." ] }, { "cell_type": "code", "execution_count": 18, "id": "6bce4d88-84ef-4f13-8441-473bbceb8292", "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "import tempfile\n", "\n", "import tiledbsoma.io\n", "\n", "uri = tempfile.mktemp()\n", "with tarfile.open(\"data/pbmc3k-sparse-pre-1.15.tgz\") as handle:\n", " handle.extractall(uri)\n", "expold = tiledbsoma.Experiment.open(uri)" ] }, { "cell_type": "markdown", "id": "3aa7fa07-aa0d-4e8e-b739-fb60d77cd971", "metadata": {}, "source": [ "This is the same PBMC3K data as above. Compare the old and new shapes:" ] }, { "cell_type": "code", "execution_count": 19, "id": "876f04a4-089d-40fe-bfe1-6c2f8474817f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9223372036854773758),)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.domain" ] }, { "cell_type": "code", "execution_count": 20, "id": "9d4e1343-f3bc-4ed7-a1db-fdc69eef3745", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9223372036854773758),)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.maxdomain" ] }, { "cell_type": "code", "execution_count": 21, "id": "c02671f3-21c7-4b5d-b659-f3b14f29d1d4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.tiledbsoma_has_upgraded_domain" ] }, { "cell_type": "code", "execution_count": 22, "id": "2c690a8d-3654-430a-ba15-eb65dfbf789b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(9223372036854773759, 9223372036854773759),\n", " (9223372036854773759, 9223372036854773759),\n", " False]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[ expold.ms[\"RNA\"].X[\"data\"].shape, expold.ms[\"RNA\"].X[\"data\"].maxshape, expold.ms[\"RNA\"].X[\"data\"].tiledbsoma_has_upgraded_shape ]" ] }, { "cell_type": "markdown", "id": "5307b6d8-3bee-48f2-84b4-0d4346eaf50f", "metadata": {}, "source": [ "Note that for the pre-1.15 experiment, the `shape` is huge -- like the `maxshape` -- and `tiledbsoma_has_upgraded_domain` is False.\n", "\n", "To make the old experiment look like the new experiment, simply call `upgrade_experiment_shapes`, and re-open:" ] }, { "cell_type": "code", "execution_count": 23, "id": "012d726c-37f7-48a7-895c-8a69a2df2323", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tiledbsoma.io.upgrade_experiment_shapes(expold.uri)" ] }, { "cell_type": "code", "execution_count": 24, "id": "3d1387ef-1738-420e-861e-6676afba58a3", "metadata": {}, "outputs": [], "source": [ "expold = tiledbsoma.open(expold.uri)" ] }, { "cell_type": "code", "execution_count": 25, "id": "d8a98c8d-6446-4a9f-91f5-9024e18b56da", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(2638, 1838), (9223372036854773759, 9223372036854773759), True]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[ expold.ms[\"RNA\"].X[\"data\"].shape, expold.ms[\"RNA\"].X[\"data\"].maxshape, expold.ms[\"RNA\"].X[\"data\"].tiledbsoma_has_upgraded_shape ]" ] }, { "cell_type": "markdown", "id": "3b42e49a-96d5-494b-80bd-3cf0816e5b38", "metadata": {}, "source": [ "Additionally, you can call `tiledbsoma.io.show_experiment_shapes(expold.uri)` before and after doing the upgrade.\n", "\n", "To run a pre-check, you can do\n", "\n", "```\n", "tiledbsoma.io.upgrade_experiment_shapes(expold.uri, check_only=True)\n", "```\n", "\n", "This won't change anything -- it'll simply tell you if the operation will be possible." ] }, { "cell_type": "markdown", "id": "a7d48ee7-7461-4370-95e1-00c12c3aa80b", "metadata": {}, "source": [ "## Advanced usage: dataframes with non-standard index columns" ] }, { "cell_type": "markdown", "id": "b2e69ef4-d9a7-4cc9-b6d0-9b55f41ae838", "metadata": {}, "source": [ "In the [SOMA data model](https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md), the `SparseNDArray` and `DenseNDArray` objects always have int64 dimensions named `soma_dim_0`, `soma_dim_1`, and up, and they have a numeric `soma_data` attribute for the contents of the array. Furthermore, this is always the case." ] }, { "cell_type": "code", "execution_count": 26, "id": "687f9cde-1ef4-49de-9cea-f80a5765a58e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "soma_dim_0: int64 not null\n", "soma_dim_1: int64 not null\n", "soma_data: float not null" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.ms[\"RNA\"].X[\"data\"].schema" ] }, { "cell_type": "markdown", "id": "139651d1-6e20-4945-9cb5-ef6ccb3e0f81", "metadata": {}, "source": [ "For dataframes, though, while there must be a `soma_joinid` column of type int64, you can have one or more other index columns in addtion -- or, `soma_joinid` can be a non-index column.\n", "\n", "This means that in the default, simplest, and most common case, you can think of a dataframe has having a shape just as the N-dimensional arrays do." ] }, { "cell_type": "code", "execution_count": 27, "id": "39a52da4-da14-421c-91aa-ba63259db1bd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "soma_joinid: int64 not null\n", "obs_id: large_string\n", "n_genes: int64\n", "percent_mito: float\n", "n_counts: float\n", "louvain: dictionary" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.obs.schema" ] }, { "cell_type": "code", "execution_count": 28, "id": "5c26c39b-c194-4d08-8840-c3d945ace18a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('soma_joinid',)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.obs.index_column_names" ] }, { "cell_type": "markdown", "id": "49111300-f537-433f-bc2a-b44ec3e2267a", "metadata": {}, "source": [ "But really, dataframes are capable of more than that, via the index-column names you specify at creation time.\n", "\n", "Let's create a couple dataframes, with the same data, but different choices of index-column names." ] }, { "cell_type": "code", "execution_count": 29, "id": "453b64d0-9a6a-4b00-80f0-38259fa1ffa4", "metadata": {}, "outputs": [], "source": [ "sdfuri1 = tempfile.mktemp()\n", "sdfuri2 = tempfile.mktemp()" ] }, { "cell_type": "code", "execution_count": 30, "id": "d93f35b4-72de-4895-aeaa-a769555de7b3", "metadata": {}, "outputs": [], "source": [ "import pyarrow as pa\n", "\n", "schema = pa.schema([\n", " (\"soma_joinid\", pa.int64()),\n", " (\"mystring\", pa.string()),\n", " (\"myint\", pa.int32()),\n", " (\"myfloat\", pa.float32()),\n", "])\n", "\n", "data = pa.Table.from_pydict({\n", " \"soma_joinid\": [0, 1],\n", " \"mystring\": [\"hello\", \"world\"],\n", " \"myint\": [33, 44],\n", " \"myfloat\": [4.5, 5.5],\n", "})" ] }, { "cell_type": "code", "execution_count": 31, "id": "86a09444-0ea6-4359-bf96-871832bb3878", "metadata": {}, "outputs": [], "source": [ "with tiledbsoma.DataFrame.create(\n", " sdfuri1,\n", " schema=schema,\n", " index_column_names=[\"soma_joinid\", \"mystring\"],\n", " domain=[(0, 9), None],\n", ") as sdf1:\n", " sdf1.write(data)" ] }, { "cell_type": "markdown", "id": "ef786e7e-ee2f-4c0f-8173-a4dec2aacd15", "metadata": {}, "source": [ "Now let's look at the `domain` and `maxdomain` for these dataframes." ] }, { "cell_type": "code", "execution_count": 32, "id": "c51530f4-a652-4b33-be26-4ec87f6b2112", "metadata": {}, "outputs": [], "source": [ "sdf1 = tiledbsoma.DataFrame.open(sdfuri1)" ] }, { "cell_type": "code", "execution_count": 33, "id": "eef4b4ba-65fe-4621-a6e0-04dd80ba9a31", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('soma_joinid', 'mystring')" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf1.index_column_names" ] }, { "cell_type": "markdown", "id": "21c247cc-159b-4205-b8b1-85d0469c9c76", "metadata": {}, "source": [ "Here we see the `soma_joinid` slot of the dataframe's domain is as requested.\n", "\n", "Another point is that domain cannot be specified for string-type index columns.\n", "\n", "You can set them at create one of two ways:\n", "\n", "```\n", " domain=[(0, 9), None],\n", "```\n", "or\n", "```\n", " domain=[(0, 9), ('', '')],\n", "```\n", "\n", "and in either case the domain slot for a string-typed index column will read back as `('', '')`." ] }, { "cell_type": "code", "execution_count": 34, "id": "d4ef8d2b-3696-4554-aab4-6641c6f5eb98", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9), ('', ''))" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf1.domain" ] }, { "cell_type": "code", "execution_count": 35, "id": "c71d796a-6160-4ade-81e0-8650dd773cc7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9223372036854775796), ('', ''))" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf1.maxdomain" ] }, { "cell_type": "markdown", "id": "1d44fe60-a29d-4234-a127-5930599f607b", "metadata": {}, "source": [ "Now let's look at our other dataframe. Here `soma_joinid` is not an index column at all. This is fine, as long as within the data you write to it, the index-column values uniquely identify each row." ] }, { "cell_type": "code", "execution_count": 36, "id": "417d1039-a646-4c47-bc63-22847fbaaf67", "metadata": {}, "outputs": [], "source": [ "with tiledbsoma.DataFrame.create(\n", " sdfuri2,\n", " schema=schema,\n", " index_column_names=[\"myfloat\", \"myint\"],\n", " domain=[(0, 999), (-1000, 1000)],\n", ") as sdf2:\n", " sdf2.write(data)" ] }, { "cell_type": "code", "execution_count": 37, "id": "f31e4b81-b009-497c-9fa8-a88eb14f99ff", "metadata": {}, "outputs": [], "source": [ "sdf2 = tiledbsoma.DataFrame.open(sdfuri2)" ] }, { "cell_type": "code", "execution_count": 38, "id": "fb4ee7ee-b038-4c82-a42b-bc2aa3b49734", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('myfloat', 'myint')" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf2.index_column_names" ] }, { "cell_type": "markdown", "id": "d3eb0ec0-dc6b-46db-8ab7-3e50cd70a621", "metadata": {}, "source": [ "The domain reads back as written." ] }, { "cell_type": "code", "execution_count": 39, "id": "2f81b556-b52b-47b2-a3fd-24f55edfbbd5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0.0, 999.0), (-1000, 1000))" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf2.domain" ] }, { "cell_type": "code", "execution_count": 40, "id": "abc9e8c4-ec87-4ca3-8529-44593b24c5ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((-3.4028234663852886e+38, 3.4028234663852886e+38), (-2147483648, 2147481645))" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf2.maxdomain" ] }, { "cell_type": "markdown", "id": "ed5477cd-35cb-4b4b-8a99-13cdc71149f0", "metadata": {}, "source": [ "## Advanced usage: using resize at the dataframe/array level using the SOMA API" ] }, { "cell_type": "markdown", "id": "900600c1-b240-4dbb-826c-20ade016b9a6", "metadata": {}, "source": [ "Above we saw a simple and convenient way to resize all the dataframes and arrays within an experiment.\n", "\n", "However, should you choose to do so, you can apply these one dataframe or array at a time.\n", "\n", "For N-dimensional arrays that have been upgraded, or that were created using TileDB-SOMA 1.15 or higher, simply do the following:\n", "\n", "* If the array's `.tiledbsoma_has_upgraded_shape` reports False, invoke the `.tiledbsoma_upgrade_shape` method.\n", "* Otherwise invoke the `.resize` method." ] }, { "cell_type": "markdown", "id": "63493ede-7e4e-471d-af21-af05839aee54", "metadata": {}, "source": [ "Let's do a fresh unpack of a pre-1.15 experiment:" ] }, { "cell_type": "code", "execution_count": 41, "id": "e8112d72-0abf-46aa-8e7f-962cf37c7199", "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "import tempfile\n", "\n", "uri = tempfile.mktemp()\n", "with tarfile.open(\"data/pbmc3k-sparse-pre-1.15.tgz\") as handle:\n", " handle.extractall(uri)\n", "expold = tiledbsoma.Experiment.open(uri)\n", "X = expold.ms[\"RNA\"].X[\"data\"]" ] }, { "cell_type": "markdown", "id": "fbe1ab0a-625f-4a9b-be5c-b324e182ff08", "metadata": {}, "source": [ "Here we see that the `X` array has not been upgraded, and that its `shape` reports the same as `maxshape`:" ] }, { "cell_type": "code", "execution_count": 42, "id": "d3a0c977-edf0-4497-8adc-8d8fb86da980", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.tiledbsoma_has_upgraded_shape" ] }, { "cell_type": "code", "execution_count": 43, "id": "d781b504-f7eb-4d80-88fd-a8e18dea179d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9223372036854773759, 9223372036854773759)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "markdown", "id": "39fd7b22-e309-4633-a605-eef4c7d6ae2d", "metadata": {}, "source": [ "Now let's give the `X` array the new-style shape:" ] }, { "cell_type": "code", "execution_count": 44, "id": "074666e9-8e47-45bf-98c8-2bb31841cd9e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 2637), (0, 1837))" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.non_empty_domain()" ] }, { "cell_type": "code", "execution_count": 45, "id": "8ac6ce56-114c-414d-a631-ebce5849754a", "metadata": {}, "outputs": [], "source": [ "with tiledbsoma.Experiment.open(uri, \"w\") as exp:\n", " exp.ms[\"RNA\"].X[\"data\"].tiledbsoma_upgrade_shape([X.non_empty_domain()[0][1]+1, X.non_empty_domain()[1][1]+1])" ] }, { "cell_type": "markdown", "id": "f23bd64f-f9e5-488b-86f2-ad9d33ba9df3", "metadata": {}, "source": [ "Next let's re-open and see what happened:" ] }, { "cell_type": "code", "execution_count": 46, "id": "74ac2f06-1b41-43b2-af84-7c02c2f975ac", "metadata": {}, "outputs": [], "source": [ "expold = tiledbsoma.Experiment.open(expold.uri)\n", "X = expold.ms[\"RNA\"].X[\"data\"]" ] }, { "cell_type": "code", "execution_count": 47, "id": "0b4b0170-75c3-4d14-8aca-ad93ac2179a1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.tiledbsoma_has_upgraded_shape" ] }, { "cell_type": "code", "execution_count": 48, "id": "1f98a73d-e20f-4cb0-9f36-8824dcfffbe6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2638, 1838)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "code", "execution_count": 49, "id": "1eaa602b-8158-4413-a5a2-449ae2ab730f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9223372036854773759, 9223372036854773759)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.maxshape" ] }, { "cell_type": "markdown", "id": "25ebf556-3d30-46c1-88a5-d901cc0339f2", "metadata": {}, "source": [ "Furthermore, we can resize it even farther:" ] }, { "cell_type": "code", "execution_count": 50, "id": "ec0cbae6-84b6-44bf-b8d4-abd964929d14", "metadata": {}, "outputs": [], "source": [ "with tiledbsoma.Experiment.open(expold.uri, \"w\") as exp:\n", " exp.ms[\"RNA\"].X[\"data\"].resize([7200, 1848])" ] }, { "cell_type": "code", "execution_count": 51, "id": "175c3b22-d696-43c9-bc00-c70644c09ed8", "metadata": {}, "outputs": [], "source": [ "expold = tiledbsoma.Experiment.open(expold.uri)\n", "X = expold.ms[\"RNA\"].X[\"data\"]" ] }, { "cell_type": "code", "execution_count": 52, "id": "fa307631-094a-45c7-b5a1-d40b24bd24bb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7200, 1848)" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "markdown", "id": "ddf99890-a8d6-4e9f-b974-c03f6b9015dc", "metadata": {}, "source": [ "For dataframes, the process is similar. If you want to expand only the soft limits for `soma_joinid`, you can use some simpler methods:\n", "\n", "* If the dataframe's `tiledbsoma_has_upgraded_domain` reports False, invoke `.tiledbsoma_upgrade_domain`\n", "* Otherwise invoke the `.change_domain` method.\n" ] }, { "cell_type": "code", "execution_count": 53, "id": "0362cafc-50ad-467a-ad2a-9a9bdb9de4a3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.tiledbsoma_has_upgraded_domain" ] }, { "cell_type": "code", "execution_count": 54, "id": "e5f7d227-8099-41bf-b9b9-66abf78a895a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9223372036854773758),)" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.domain" ] }, { "cell_type": "code", "execution_count": 55, "id": "1cfe37da-0b5f-47a2-b7cf-f572f1f5b3ae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9223372036854773758),)" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.maxdomain" ] }, { "cell_type": "code", "execution_count": 56, "id": "0bd612dd-6722-47ac-ad1f-9ae48fc3d121", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 2637),)" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.non_empty_domain()" ] }, { "cell_type": "code", "execution_count": 57, "id": "07034a75-f509-42d5-8de9-56798b523f2b", "metadata": {}, "outputs": [], "source": [ "with tiledbsoma.Experiment.open(expold.uri, \"w\") as exp:\n", " exp.obs.tiledbsoma_upgrade_domain([[0, expold.obs.non_empty_domain()[0][1]+1]])" ] }, { "cell_type": "code", "execution_count": 58, "id": "74e90a63-c8e7-4892-895a-584363d58284", "metadata": {}, "outputs": [], "source": [ "expold = tiledbsoma.Experiment.open(expold.uri)" ] }, { "cell_type": "code", "execution_count": 59, "id": "ab5ed97a-f151-42de-be48-237e0b788501", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.tiledbsoma_has_upgraded_domain" ] }, { "cell_type": "code", "execution_count": 60, "id": "f70a0d9c-1cde-4f80-9b3b-d2692e796973", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 2638),)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.domain" ] }, { "cell_type": "code", "execution_count": 61, "id": "41968785-373b-4c48-b90e-3a24a6b9c199", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9223372036854773758),)" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expold.obs.maxdomain" ] }, { "cell_type": "code", "execution_count": null, "id": "8eb0e844-55d2-4c0c-bfd8-aee10a0303c6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "3.11.10", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 5 }