{ "cells": [ { "cell_type": "markdown", "id": "2cf7c05c-f723-489d-8c39-3e2841f655b0", "metadata": {}, "source": [ "# Tutorial: SOMA shapes\n", "\n", "As of TileDB-SOMA 1.15 we're proud to support a more intutive and extensible notion of `shape`.\n", "\n", "In this notebook, we'll go through how you use shapes for the dataframes and arrays within your SOMA experiments, when and how you can resize, and options for experiments created before TileDB-SOMA 1.15.\n", "\n", "The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics.\n", "\n", "(Please also see the [Academy tutorial](https://cloud.tiledb.com/academy/structure/life-sciences/single-cell/tutorials/shapes/).)" ] }, { "cell_type": "markdown", "id": "167dba53-7da6-4984-bbe7-a5416e60325d", "metadata": {}, "source": [ "We'll start by importing `tiledbsoma`." ] }, { "cell_type": "code", "execution_count": 3, "id": "90db6017-a084-43f5-8f7e-bff281e9a898", "metadata": { "tags": [] }, "outputs": [], "source": [ "import tiledbsoma" ] }, { "cell_type": "markdown", "id": "ca9f7272-09e0-4eda-a569-8796a14bf776", "metadata": { "tags": [] }, "source": [ "## The shape feature" ] }, { "cell_type": "markdown", "id": "41358011-b835-4c3a-a75e-79a80f4cc3a1", "metadata": {}, "source": [ "As we've seen in other tutorials in this series, the SOMA data model brings across many familiar concepts from AnnData. This includes the ability to ask component dataframes and arrays what their shapes are." ] }, { "cell_type": "markdown", "id": "86b5c5e4-c0b7-4426-b312-2d19e40aa454", "metadata": {}, "source": [ "First, let's unpack and open an experiment." ] }, { "cell_type": "code", "execution_count": 4, "id": "1e02d2b9-c492-4e02-9022-203a1d65282c", "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "import tempfile\n", "\n", "uri = tempfile.mktemp()\n", "with tarfile.open(\"data/pbmc3k-sparse.tgz\") as handle:\n", " handle.extractall(uri)\n", "exp = tiledbsoma.Experiment.open(uri)" ] }, { "cell_type": "markdown", "id": "2d934ed9-5b41-4af8-a737-23583f6e885b", "metadata": {}, "source": [ "The `obs` dataframe has a domain, which is a soft limit on what values can be written to it. You'll get an exception if you try to read or write `soma_joinid` values outside this range, which is an important data-integrity reassurance.\n", "\n", "The domain we see here matches with the data populated inside of it.\n", "\n", "(This will usually be the case. It might not, if you've created the dataframe but not written any data to it yet -- at that point it's empty but it still has a shape.)\n", "\n", "If you have more data -- more cells -- to add to the experiment later, you will be able resize the `obs`, up to the `maxdomain` which is a hard limit." ] }, { "cell_type": "code", "execution_count": 5, "id": "c90a840e-559f-4dfb-a9f8-5bcd629c714c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 2637),)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.obs.domain" ] }, { "cell_type": "code", "execution_count": 6, "id": "9a17cd6c-864d-4b83-915e-9ea67e042bab", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((0, 9223372036854773758),)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.obs.maxdomain" ] }, { "cell_type": "code", "execution_count": 7, "id": "9967e115-6277-4203-b61b-96d1c5b04fde", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | soma_joinid | \n", "obs_id | \n", "n_genes | \n", "percent_mito | \n", "n_counts | \n", "louvain | \n", "
---|---|---|---|---|---|---|
0 | \n", "0 | \n", "AAACATACAACCAC-1 | \n", "781 | \n", "0.030178 | \n", "2419.0 | \n", "CD4 T cells | \n", "
1 | \n", "1 | \n", "AAACATTGAGCTAC-1 | \n", "1352 | \n", "0.037936 | \n", "4903.0 | \n", "B cells | \n", "
2 | \n", "2 | \n", "AAACATTGATCAGC-1 | \n", "1131 | \n", "0.008897 | \n", "3147.0 | \n", "CD4 T cells | \n", "
3 | \n", "3 | \n", "AAACCGTGCTTCCG-1 | \n", "960 | \n", "0.017431 | \n", "2639.0 | \n", "CD14+ Monocytes | \n", "
4 | \n", "4 | \n", "AAACCGTGTATGCG-1 | \n", "522 | \n", "0.012245 | \n", "980.0 | \n", "NK cells | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2633 | \n", "2633 | \n", "TTTCGAACTCTCAT-1 | \n", "1155 | \n", "0.021104 | \n", "3459.0 | \n", "CD14+ Monocytes | \n", "
2634 | \n", "2634 | \n", "TTTCTACTGAGGCA-1 | \n", "1227 | \n", "0.009294 | \n", "3443.0 | \n", "B cells | \n", "
2635 | \n", "2635 | \n", "TTTCTACTTCCTCG-1 | \n", "622 | \n", "0.021971 | \n", "1684.0 | \n", "B cells | \n", "
2636 | \n", "2636 | \n", "TTTGCATGAGAGGC-1 | \n", "454 | \n", "0.020548 | \n", "1022.0 | \n", "B cells | \n", "
2637 | \n", "2637 | \n", "TTTGCATGCCTCAC-1 | \n", "724 | \n", "0.008065 | \n", "1984.0 | \n", "CD4 T cells | \n", "
2638 rows × 6 columns
\n", "