{ "cells": [ { "cell_type": "markdown", "id": "d03b9481-43e7-4f16-8141-4c0ab305ec74", "metadata": { "tags": [] }, "source": [ "# Tutorial: Reading SOMA Objects" ] }, { "cell_type": "markdown", "id": "2683d7ec-03f8-4a34-9403-4394420cd29c", "metadata": {}, "source": [ "In this notebook we'll learn how to read from various SOMA objects. We will assume familiarity with SOMA objects already, so it is recommended to go through the [Tutorial: SOMA Objects](https://github.com/single-cell-data/TileDB-SOMA/blob/main/apis/python/notebooks/tutorial_soma_objects.ipynb) before." ] }, { "cell_type": "markdown", "id": "a66a7b99-6e25-4f46-9555-41edbc7fb3ee", "metadata": { "tags": [] }, "source": [ "This implementation of SOMA relies on [TileDB](https://tiledb.com/), which is a storage format that allows working with large files without having to fully load them in memory. Files can be either read from disk or from a remote source, like an S3 bucket. " ] }, { "cell_type": "markdown", "id": "9cbeef30-f87a-4d59-bee1-6a1dc865aefb", "metadata": { "tags": [] }, "source": [ "The core feature of SOMA is to allow reading _subsets_ of the data using slices: only the portion of required data is read from disk/network.\n", "SOMA uses [Apache Arrow](https://arrow.apache.org/) as an intermediate in-memory storage. From here, the slices can be further converted into more familiar formats, like a scipy.sparse matrix or a numpy ndarray. Consult the [Python bindings for Apache Arrow documentation](https://arrow.apache.org/docs/python/index.html) for more information." ] }, { "cell_type": "markdown", "id": "3fef57da-e665-4990-aebd-a89596031935", "metadata": { "tags": [] }, "source": [ "In this notebook, we will use the Peripheral Blood Mononuclear Cells (PBMC) dataset. We will focus on reading from its `obs` `DataFrame` and from the `X` `SparseNDArray`. This is a small dataset that can fit in memory, but we'll focus on operations that work on subsets of data that will work on larger datasets as well." ] }, { "cell_type": "markdown", "id": "42228ff1-4660-4dd6-b627-75b54e6abcb8", "metadata": { "tags": [] }, "source": [ "## Reading a DataFrame" ] }, { "cell_type": "markdown", "id": "28401793-71c1-4d1c-ac9a-2fe255d8821d", "metadata": { "tags": [] }, "source": [ "### Introduction" ] }, { "cell_type": "code", "execution_count": 1, "id": "f843b57e-efd1-4a27-9778-c8b2c1aaa686", "metadata": { "tags": [] }, "outputs": [], "source": [ "import tiledbsoma" ] }, { "cell_type": "code", "execution_count": 2, "id": "18d5412e-5bae-4706-bb79-2692635190ce", "metadata": { "tags": [] }, "outputs": [], "source": [ "import tarfile\n", "import tempfile\n", "\n", "sparse_uri = tempfile.mktemp()\n", "with tarfile.open(\"data/pbmc3k-sparse.tgz\") as handle:\n", " handle.extractall(sparse_uri)\n", "experiment = tiledbsoma.Experiment.open(sparse_uri)" ] }, { "cell_type": "markdown", "id": "566b4df7-b26a-487b-8d3f-8616bd84a23c", "metadata": { "tags": [] }, "source": [ "All read operations need to be performed using the `.read()` method. For a `DataFrame`, we want to then call `.concat()` to obtain a [PyArrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html):" ] }, { "cell_type": "code", "execution_count": 3, "id": "c9f445b5-6fee-40a7-9ba8-37c9a72efb2f", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "pyarrow.Table\n", "soma_joinid: int64\n", "obs_id: large_string\n", "n_genes: int64\n", "percent_mito: float\n", "n_counts: float\n", "louvain: dictionary\n", "----\n", "soma_joinid: [[0,1,2,3,4,...,2633,2634,2635,2636,2637]]\n", "obs_id: [[\"AAACATACAACCAC-1\",\"AAACATTGAGCTAC-1\",\"AAACATTGATCAGC-1\",\"AAACCGTGCTTCCG-1\",\"AAACCGTGTATGCG-1\",...,\"TTTCGAACTCTCAT-1\",\"TTTCTACTGAGGCA-1\",\"TTTCTACTTCCTCG-1\",\"TTTGCATGAGAGGC-1\",\"TTTGCATGCCTCAC-1\"]]\n", "n_genes: [[781,1352,1131,960,522,...,1155,1227,622,454,724]]\n", "percent_mito: [[0.030177759,0.037935957,0.008897362,0.017430846,0.012244898,...,0.021104366,0.00929422,0.021971496,0.020547945,0.008064516]]\n", "n_counts: [[2419,4903,3147,2639,980,...,3459,3443,1684,1022,1984]]\n", "louvain: [ -- dictionary:\n", "[\"CD4 T cells\",\"CD14+ Monocytes\",\"B cells\",\"CD8 T cells\",\"NK cells\",\"FCGR3A+ Monocytes\",\"Dendritic cells\",\"Megakaryocytes\"] -- indices:\n", "[0,2,0,1,4,...,1,2,2,2,0]]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs = experiment.obs\n", "table = obs.read().concat()\n", "table" ] }, { "cell_type": "markdown", "id": "e1005cc8-289f-470d-ad60-0ee0975b3fe5", "metadata": { "tags": [] }, "source": [ "From here, we can directly use any of the PyArrow Table methods, for instance:" ] }, { "cell_type": "code", "execution_count": 4, "id": "11291dbd-3272-4c84-bed5-4e1b67f408b9", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "pyarrow.Table\n", "soma_joinid: int64\n", "obs_id: large_string\n", "n_genes: int64\n", "percent_mito: float\n", "n_counts: float\n", "louvain: dictionary\n", "----\n", "soma_joinid: [[270,1163,1891,926,277,...,2186,1522,662,1288,1840]]\n", "obs_id: [[\"ACGAACTGGCTATG-1\",\"CGATACGACAGGAG-1\",\"GGGCCAACCTTGGA-1\",\"CAGGTTGAGGATCT-1\",\"ACGAGGGACAGGAG-1\",...,\"TAGTCTTGGCTGTA-1\",\"GACGCTCTCTCTCG-1\",\"ATCTCAACCTCGAA-1\",\"CTAATAGAGCTATG-1\",\"GGCATATGGGGAGT-1\"]]\n", "n_genes: [[2455,2033,2020,2000,1997,...,270,267,246,239,212]]\n", "percent_mito: [[0.015774649,0.022166021,0.010576352,0.026962927,0.014631685,...,0,0.032258064,0,0.0016666667,0.012173913]]\n", "n_counts: [[8875,6722,8415,8011,7928,...,652,682,609,600,575]]\n", "louvain: [ -- dictionary:\n", "[\"CD4 T cells\",\"CD14+ Monocytes\",\"B cells\",\"CD8 T cells\",\"NK cells\",\"FCGR3A+ Monocytes\",\"Dendritic cells\",\"Megakaryocytes\"] -- indices:\n", "[7,0,6,2,6,...,0,7,0,0,7]]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.sort_by([(\"n_genes\", \"descending\")])" ] }, { "cell_type": "markdown", "id": "fe90e126-ca37-4444-acdf-c7120fe2bea8", "metadata": { "tags": [] }, "source": [ "Alternatively, we can convert the `DataFrame` to a different format, like a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):" ] }, { "cell_type": "code", "execution_count": 5, "id": "f4073542-da95-4158-97c7-c1dd442de930", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
00AAACATACAACCAC-17810.0301782419.0CD4 T cells
11AAACATTGAGCTAC-113520.0379364903.0B cells
22AAACATTGATCAGC-111310.0088973147.0CD4 T cells
33AAACCGTGCTTCCG-19600.0174312639.0CD14+ Monocytes
44AAACCGTGTATGCG-15220.012245980.0NK cells
.....................
26332633TTTCGAACTCTCAT-111550.0211043459.0CD14+ Monocytes
26342634TTTCTACTGAGGCA-112270.0092943443.0B cells
26352635TTTCTACTTCCTCG-16220.0219711684.0B cells
26362636TTTGCATGAGAGGC-14540.0205481022.0B cells
26372637TTTGCATGCCTCAC-17240.0080651984.0CD4 T cells
\n", "

2638 rows × 6 columns

\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts \\\n", "0 0 AAACATACAACCAC-1 781 0.030178 2419.0 \n", "1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 \n", "2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 \n", "3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 \n", "4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 \n", "... ... ... ... ... ... \n", "2633 2633 TTTCGAACTCTCAT-1 1155 0.021104 3459.0 \n", "2634 2634 TTTCTACTGAGGCA-1 1227 0.009294 3443.0 \n", "2635 2635 TTTCTACTTCCTCG-1 622 0.021971 1684.0 \n", "2636 2636 TTTGCATGAGAGGC-1 454 0.020548 1022.0 \n", "2637 2637 TTTGCATGCCTCAC-1 724 0.008065 1984.0 \n", "\n", " louvain \n", "0 CD4 T cells \n", "1 B cells \n", "2 CD4 T cells \n", "3 CD14+ Monocytes \n", "4 NK cells \n", "... ... \n", "2633 CD14+ Monocytes \n", "2634 B cells \n", "2635 B cells \n", "2636 B cells \n", "2637 CD4 T cells \n", "\n", "[2638 rows x 6 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.to_pandas()" ] }, { "cell_type": "markdown", "id": "54832ad3-b548-4c62-9c1c-1edd14c002b3", "metadata": { "tags": [] }, "source": [ "### Reading slices of data" ] }, { "cell_type": "markdown", "id": "6cab4fe3-76ac-4507-94e5-bc038b35a1cd", "metadata": { "tags": [] }, "source": [ "As previously mentioned, the core feature of SOMA is reading slices of the data without fetching the whole dataset in memory. To do that, the `.read()` method supports a `coords` parameter that allows data slicing. \n", "\n", "Before we do that, let's take a look at the schema of the `obs` dataframe:" ] }, { "cell_type": "code", "execution_count": 6, "id": "83a6965d-441a-473c-8307-fda71a68ed11", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "soma_joinid: int64 not null\n", "obs_id: large_string\n", "n_genes: int64\n", "percent_mito: float\n", "n_counts: float\n", "louvain: dictionary" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.schema" ] }, { "cell_type": "markdown", "id": "dc5acf67-36c3-40e0-8aa5-2a9cde7c69b8", "metadata": { "tags": [] }, "source": [ "And also its domain:" ] }, { "cell_type": "code", "execution_count": 7, "id": "92fe8e3b-13ad-4922-9bbd-e9fb4600f56e", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "((0, 2637),)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.domain" ] }, { "cell_type": "markdown", "id": "72d5ccfc-82b7-44d5-af80-e6ac6c2a135b", "metadata": {}, "source": [ "With a SOMA DataFrame, you can only slice across an indexed column, so let's look at the indexed columns:" ] }, { "cell_type": "code", "execution_count": 8, "id": "e7e79037-9494-4b0b-a898-d9364fa1758b", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "('soma_joinid',)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.index_column_names" ] }, { "cell_type": "markdown", "id": "25095993-9343-4483-9c86-5ff3c4a40df6", "metadata": {}, "source": [ "In this case our index consists of just `soma_joinid`, which is an integer column that can be used to join other SOMA objects in the same experiment. \n", "\n", "\n", "Let's look at a few ways to slice the dataframe." ] }, { "cell_type": "markdown", "id": "f6245b69-da9e-4d1f-971c-b86e3a2b69aa", "metadata": { "tags": [] }, "source": [ "#### Select a single row" ] }, { "cell_type": "code", "execution_count": 9, "id": "042da676-1dc7-4916-bccd-dbcf9e753de8", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
00AAACATACAACCAC-17810.0301782419.0CD4 T cells
\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts louvain\n", "0 0 AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read([[0]]).concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "c17e4c73-fca6-440f-b0a4-343e36c604f5", "metadata": {}, "source": [ "#### Select multiple, non contiguous rows" ] }, { "cell_type": "code", "execution_count": 10, "id": "d77e6897-7940-49fe-a269-548171336fc9", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
02AAACATTGATCAGC-111310.0088973147.0CD4 T cells
15AAACGCACTGGTAC-17820.0166442163.0CD8 T cells
\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts louvain\n", "0 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells\n", "1 5 AAACGCACTGGTAC-1 782 0.016644 2163.0 CD8 T cells" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read([[2, 5]]).concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "33b6cf15-5005-4aef-9212-cdd34be8a9ba", "metadata": { "tags": [] }, "source": [ "#### Select a slice of rows" ] }, { "cell_type": "code", "execution_count": 11, "id": "dfba21c5-504c-4644-ad53-7cf7865ebf31", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
00AAACATACAACCAC-17810.0301782419.0CD4 T cells
11AAACATTGAGCTAC-113520.0379364903.0B cells
22AAACATTGATCAGC-111310.0088973147.0CD4 T cells
33AAACCGTGCTTCCG-19600.0174312639.0CD14+ Monocytes
44AAACCGTGTATGCG-15220.012245980.0NK cells
55AAACGCACTGGTAC-17820.0166442163.0CD8 T cells
\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts \\\n", "0 0 AAACATACAACCAC-1 781 0.030178 2419.0 \n", "1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 \n", "2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 \n", "3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 \n", "4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 \n", "5 5 AAACGCACTGGTAC-1 782 0.016644 2163.0 \n", "\n", " louvain \n", "0 CD4 T cells \n", "1 B cells \n", "2 CD4 T cells \n", "3 CD14+ Monocytes \n", "4 NK cells \n", "5 CD8 T cells " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read([slice(0, 5)]).concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "50f00bbf-6bd6-4279-9e92-b4bf544c4702", "metadata": { "tags": [] }, "source": [ "#### Select a subset of columns only" ] }, { "cell_type": "code", "execution_count": 12, "id": "a91e68e7-5c06-4241-b4ee-42a59794e520", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
obs_idlouvain
0AAACATACAACCAC-1CD4 T cells
1AAACATTGAGCTAC-1B cells
2AAACATTGATCAGC-1CD4 T cells
3AAACCGTGCTTCCG-1CD14+ Monocytes
4AAACCGTGTATGCG-1NK cells
5AAACGCACTGGTAC-1CD8 T cells
\n", "
" ], "text/plain": [ " obs_id louvain\n", "0 AAACATACAACCAC-1 CD4 T cells\n", "1 AAACATTGAGCTAC-1 B cells\n", "2 AAACATTGATCAGC-1 CD4 T cells\n", "3 AAACCGTGCTTCCG-1 CD14+ Monocytes\n", "4 AAACCGTGTATGCG-1 NK cells\n", "5 AAACGCACTGGTAC-1 CD8 T cells" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read([slice(0, 5)], column_names=[\"obs_id\", \"louvain\"]).concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "2bc1212e-ac25-4278-a14a-b0b876337ff6", "metadata": { "tags": [] }, "source": [ "### Filter data using complex queries" ] }, { "cell_type": "markdown", "id": "d60a0981-38d6-4460-a4b3-d3431ce43b40", "metadata": { "tags": [] }, "source": [ "SOMA also allows to filter data using more complex queries. For a more detailed reference, take a look at the [query condition](https://github.com/single-cell-data/TileDB-SOMA/blob/main/apis/python/src/tiledbsoma/_query_condition.py) source code.\n", "\n", "Here are a few examples:" ] }, { "cell_type": "markdown", "id": "266a6574-b1f3-41f0-94a3-1ccfec8b82af", "metadata": { "tags": [] }, "source": [ "#### Filter all cells with a Louvain categorization of \"B cells\"" ] }, { "cell_type": "code", "execution_count": 13, "id": "77feed71-7ac9-44d5-af35-c37601067092", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
01AAACATTGAGCTAC-113520.0379364903.0B cells
110AAACTTGAAAAACG-111160.0263163914.0B cells
218AAAGGCCTGTCTAG-114460.0152834973.0B cells
319AAAGTTTGATCACG-14460.0347001268.0B cells
420AAAGTTTGGGGTGA-110200.0259073281.0B cells
.....................
3372628TTTCAGTGTCACGA-17000.0343141632.0B cells
3382630TTTCAGTGTGCAGT-16370.0189251321.0B cells
3392634TTTCTACTGAGGCA-112270.0092943443.0B cells
3402635TTTCTACTTCCTCG-16220.0219711684.0B cells
3412636TTTGCATGAGAGGC-14540.0205481022.0B cells
\n", "

342 rows × 6 columns

\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts louvain\n", "0 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells\n", "1 10 AAACTTGAAAAACG-1 1116 0.026316 3914.0 B cells\n", "2 18 AAAGGCCTGTCTAG-1 1446 0.015283 4973.0 B cells\n", "3 19 AAAGTTTGATCACG-1 446 0.034700 1268.0 B cells\n", "4 20 AAAGTTTGGGGTGA-1 1020 0.025907 3281.0 B cells\n", ".. ... ... ... ... ... ...\n", "337 2628 TTTCAGTGTCACGA-1 700 0.034314 1632.0 B cells\n", "338 2630 TTTCAGTGTGCAGT-1 637 0.018925 1321.0 B cells\n", "339 2634 TTTCTACTGAGGCA-1 1227 0.009294 3443.0 B cells\n", "340 2635 TTTCTACTTCCTCG-1 622 0.021971 1684.0 B cells\n", "341 2636 TTTGCATGAGAGGC-1 454 0.020548 1022.0 B cells\n", "\n", "[342 rows x 6 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read(value_filter=\"louvain == 'B cells'\").concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "751027d9-2989-47db-9531-7e3e706de942", "metadata": { "tags": [] }, "source": [ "#### Filter all cells with a Louvain categorization of either \"CD4 T cells\" or \"CD8 T cells\"" ] }, { "cell_type": "code", "execution_count": 14, "id": "157d42e0-89d8-4ab2-a4c9-8c9f0f16f943", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
00AAACATACAACCAC-17810.0301782419.0CD4 T cells
12AAACATTGATCAGC-111310.0088973147.0CD4 T cells
25AAACGCACTGGTAC-17820.0166442163.0CD8 T cells
36AAACGCTGACCAGT-17830.0381612175.0CD8 T cells
47AAACGCTGGTTCTT-17900.0309732260.0CD8 T cells
.....................
14552621TTTAGCTGATACCG-18870.0228762754.0CD4 T cells
14562626TTTCACGAGGTTCA-17210.0132612036.0CD4 T cells
14572627TTTCAGTGGAAGGC-16920.0151691780.0CD8 T cells
14582631TTTCCAGAGGTGAG-18730.0068592187.0CD4 T cells
14592637TTTGCATGCCTCAC-17240.0080651984.0CD4 T cells
\n", "

1460 rows × 6 columns

\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts \\\n", "0 0 AAACATACAACCAC-1 781 0.030178 2419.0 \n", "1 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 \n", "2 5 AAACGCACTGGTAC-1 782 0.016644 2163.0 \n", "3 6 AAACGCTGACCAGT-1 783 0.038161 2175.0 \n", "4 7 AAACGCTGGTTCTT-1 790 0.030973 2260.0 \n", "... ... ... ... ... ... \n", "1455 2621 TTTAGCTGATACCG-1 887 0.022876 2754.0 \n", "1456 2626 TTTCACGAGGTTCA-1 721 0.013261 2036.0 \n", "1457 2627 TTTCAGTGGAAGGC-1 692 0.015169 1780.0 \n", "1458 2631 TTTCCAGAGGTGAG-1 873 0.006859 2187.0 \n", "1459 2637 TTTGCATGCCTCAC-1 724 0.008065 1984.0 \n", "\n", " louvain \n", "0 CD4 T cells \n", "1 CD4 T cells \n", "2 CD8 T cells \n", "3 CD8 T cells \n", "4 CD8 T cells \n", "... ... \n", "1455 CD4 T cells \n", "1456 CD4 T cells \n", "1457 CD8 T cells \n", "1458 CD4 T cells \n", "1459 CD4 T cells \n", "\n", "[1460 rows x 6 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read(value_filter=\"(louvain == 'CD4 T cells') or (louvain == 'CD8 T cells')\").concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "518a77ab-49f1-47bb-8114-e6f62f32616d", "metadata": { "tags": [] }, "source": [ "#### Filter all cells with a Louvain categorization of \"CD4 T cells\" and more than 1500 genes" ] }, { "cell_type": "code", "execution_count": 15, "id": "8bf9180b-5ebb-4a19-a602-b9495b33617f", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
026AAATCAACCCTATT-115450.0243135676.0CD4 T cells
1357ACTCTCCTGCATAC-117500.0174365850.0CD4 T cells
2473AGCTGCCTTTCATC-117030.0295475212.0CD4 T cells
3945CATACTTGGGTTAC-119380.0235807167.0CD4 T cells
41163CGATACGACAGGAG-120330.0221666722.0CD4 T cells
51320CTATACTGTTCGTT-115430.0123954760.0CD4 T cells
61548GAGCATACTTTGCT-117530.0167396691.0CD4 T cells
71993GTGATGACAAGTGA-118190.0211726329.0CD4 T cells
82313TCGGACCTGTACAC-115670.0142885599.0CD4 T cells
92365TGAGACACAAGGTA-115490.0132425135.0CD4 T cells
\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts louvain\n", "0 26 AAATCAACCCTATT-1 1545 0.024313 5676.0 CD4 T cells\n", "1 357 ACTCTCCTGCATAC-1 1750 0.017436 5850.0 CD4 T cells\n", "2 473 AGCTGCCTTTCATC-1 1703 0.029547 5212.0 CD4 T cells\n", "3 945 CATACTTGGGTTAC-1 1938 0.023580 7167.0 CD4 T cells\n", "4 1163 CGATACGACAGGAG-1 2033 0.022166 6722.0 CD4 T cells\n", "5 1320 CTATACTGTTCGTT-1 1543 0.012395 4760.0 CD4 T cells\n", "6 1548 GAGCATACTTTGCT-1 1753 0.016739 6691.0 CD4 T cells\n", "7 1993 GTGATGACAAGTGA-1 1819 0.021172 6329.0 CD4 T cells\n", "8 2313 TCGGACCTGTACAC-1 1567 0.014288 5599.0 CD4 T cells\n", "9 2365 TGAGACACAAGGTA-1 1549 0.013242 5135.0 CD4 T cells" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read(value_filter=\"(louvain == 'CD4 T cells') and (n_genes > 1500)\").concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "fa8abca5-0ba4-40a9-befb-e1d86f6bfd79", "metadata": { "tags": [] }, "source": [ "## Reading a SparseNDArray" ] }, { "cell_type": "markdown", "id": "8032d56c-a472-4f35-b6db-cd03ee1e7fcd", "metadata": {}, "source": [ "For `SparseNDArray`, let's consider the X matrix:" ] }, { "cell_type": "code", "execution_count": 16, "id": "f5dc0937-9022-4cb3-8ee7-73bdfe1f234d", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = experiment.ms[\"RNA\"].X[\"data\"]\n", "X" ] }, { "cell_type": "markdown", "id": "9e743e0d-25f1-4ddd-9ffa-936affce1fd8", "metadata": { "tags": [] }, "source": [ "Similarly to `DataFrame`, we need to use the `.read()` method:" ] }, { "cell_type": "code", "execution_count": 17, "id": "b580744c-0e2a-4a65-b86b-6c2f1eb9b3ae", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.read()" ] }, { "cell_type": "markdown", "id": "5d64afe7-6fe9-4872-864a-88329042fe72", "metadata": { "tags": [] }, "source": [ "In this case, we have two options. Let's start by converting this into an [Arrow SparseCOOTensor](https://arrow.apache.org/docs/cpp/api/tensor.html#sparse-tensors):" ] }, { "cell_type": "code", "execution_count": 18, "id": "1bbcc301-815e-482e-9558-5a1cd2e117c6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "\n", "type: float\n", "shape: (2638, 1838)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tensor = X.read().coos().concat()\n", "tensor" ] }, { "cell_type": "markdown", "id": "3c195a65-93ee-43ab-8eec-692c70d29512", "metadata": { "tags": [] }, "source": [ "In this example, we obtain a 2-dimensional tensor (a matrix). Note that `shape` here indicates the _capacity_ of the tensor, rather than the actual size. \n", "\n", "By default, a `SparseNDArray` gets created with a much higher capacity to accommodate further writes. Since this is a read scenario, and the shape of the matrix is known, we can call `.coos()` with a parameter so that the `SparseNDArray` is resized accordingly:" ] }, { "cell_type": "code", "execution_count": 19, "id": "cbeea976-f00c-47b6-b8d1-4ea1b31afd25", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "\n", "type: float\n", "shape: (2638, 1838)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n_obs = len(obs)\n", "n_var = len(experiment.ms[\"RNA\"].var)\n", "\n", "tensor = X.read().coos((n_obs, n_var)).concat()\n", "tensor" ] }, { "cell_type": "markdown", "id": "df7d1739-ec49-41dc-8022-add924c2767c", "metadata": {}, "source": [ "We can convert this to a `scipy.sparse.coo_matrix`:" ] }, { "cell_type": "code", "execution_count": 20, "id": "28d04c96-01a3-429e-a5a8-84ec8bca0453", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tensor.to_scipy()" ] }, { "cell_type": "markdown", "id": "4e2210cf-35d2-49d4-a7d3-261c01469e5c", "metadata": { "tags": [] }, "source": [ "### Reading slices of data" ] }, { "cell_type": "markdown", "id": "b78cdfe0-6cfc-4183-9a74-023668b21208", "metadata": {}, "source": [ "Similarly to `DataFrame`, we can retrieve subsets of the data that can fit in memory. This is particularly important with `SparseNDArray`s since often those are several gigabytes. \n", "\n", "Unlike `DataFrame`s, `SparseNDArray`s are always indexed using an offset (zero-based) integer on each dimension. Therefore, if the array is N-dimensional, the `.read()` method can accept a n-tuple (or list) argument that specifies how to slice the array. An empty element or `slice(None)` means select all in that dimension.\n", "\n", "For example, here's how to fetch the first 5 rows of the matrix:" ] }, { "cell_type": "code", "execution_count": 21, "id": "6f757ce0-d9dc-44bb-99f6-3be709edff0e", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "\n", "type: float\n", "shape: (2638, 1838)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Y = X.read([slice(0, 5)]).coos().concat()\n", "Y " ] }, { "cell_type": "markdown", "id": "4f99d59e-8be4-4903-94ac-8c740caacac3", "metadata": { "tags": [] }, "source": [ "Being only 5 rows, this slice can fit in memory even for bigger matrices than the one used in the example. Note that we can't simply materialize to a dense matrix since the shape is too big (running `Y.to_scipy().todense()` will raise an error), so we need to set bounding boxes:" ] }, { "cell_type": "code", "execution_count": 22, "id": "32f21b63-83a7-45e3-a18b-f44f068b1697", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "\n", "type: float\n", "shape: (2638, 1838)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Y = X.read((slice(0, 5),)).coos((n_obs, n_var)).concat()\n", "Y" ] }, { "cell_type": "markdown", "id": "7fee7237-aa55-408c-bee2-3cf4f6844831", "metadata": { "tags": [] }, "source": [ "Now we can get a dense representation of it:" ] }, { "cell_type": "code", "execution_count": 23, "id": "45874532-327a-46fa-ad27-7043bd80e8f9", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,\n", " -0.2090951 , -0.5312034 ],\n", " [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,\n", " -0.31314576, -0.5966544 ],\n", " [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,\n", " -0.17087643, 1.379 ],\n", " ...,\n", " [ 0. , 0. , 0. , ..., 0. ,\n", " 0. , 0. ],\n", " [ 0. , 0. , 0. , ..., 0. ,\n", " 0. , 0. ],\n", " [ 0. , 0. , 0. , ..., 0. ,\n", " 0. , 0. ]], dtype=float32)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Y.to_scipy().todense()" ] }, { "cell_type": "markdown", "id": "72f42cb0-e7f6-45dc-a56a-b5b8c4c068b1", "metadata": {}, "source": [ "Alternatively, we can convert it to a `scipy.sparse.csr_matrix` which allows to select specific rows:" ] }, { "cell_type": "code", "execution_count": 24, "id": "972de6ca-19a7-49f0-8fe9-7ece199cad1d", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Z = Y.to_scipy().tocsr()\n", "Z" ] }, { "cell_type": "code", "execution_count": 25, "id": "10c5f21e-5ec2-44ee-bae8-302d5f874a41", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Z.getrow(0)" ] }, { "cell_type": "markdown", "id": "4e24d636-dd66-4378-a652-d6b5086c76b1", "metadata": {}, "source": [ "Similarly, we can slice the original `SparseNDArray` using single rows:" ] }, { "cell_type": "code", "execution_count": 26, "id": "10856361-4b6e-476b-9938-a52ef50e6db1", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,\n", " -0.2090951 , -0.5312034 ],\n", " [ 0. , 0. , 0. , ..., 0. ,\n", " 0. , 0. ],\n", " [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,\n", " -0.17087643, 1.379 ],\n", " ...,\n", " [ 0. , 0. , 0. , ..., 0. ,\n", " 0. , 0. ],\n", " [ 0. , 0. , 0. , ..., 0. ,\n", " 0. , 0. ],\n", " [ 0. , 0. , 0. , ..., 0. ,\n", " 0. , 0. ]], dtype=float32)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.read([[0,2]]).coos((n_obs, n_var)).concat().to_scipy().todense()" ] }, { "cell_type": "markdown", "id": "8c5eba56-de07-4025-8ccc-3dd4f4fedb90", "metadata": {}, "source": [ "The same approach can be used to filter across all the dimensions." ] }, { "cell_type": "code", "execution_count": 27, "id": "45f2aa69-238e-478f-8c3a-9d4c90f3e507", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
00AAACATACAACCAC-17810.0301782419.0CD4 T cells
11AAACATTGAGCTAC-113520.0379364903.0B cells
22AAACATTGATCAGC-111310.0088973147.0CD4 T cells
33AAACCGTGCTTCCG-19600.0174312639.0CD14+ Monocytes
44AAACCGTGTATGCG-15220.012245980.0NK cells
.....................
26332633TTTCGAACTCTCAT-111550.0211043459.0CD14+ Monocytes
26342634TTTCTACTGAGGCA-112270.0092943443.0B cells
26352635TTTCTACTTCCTCG-16220.0219711684.0B cells
26362636TTTGCATGAGAGGC-14540.0205481022.0B cells
26372637TTTGCATGCCTCAC-17240.0080651984.0CD4 T cells
\n", "

2638 rows × 6 columns

\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts \\\n", "0 0 AAACATACAACCAC-1 781 0.030178 2419.0 \n", "1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 \n", "2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 \n", "3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 \n", "4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 \n", "... ... ... ... ... ... \n", "2633 2633 TTTCGAACTCTCAT-1 1155 0.021104 3459.0 \n", "2634 2634 TTTCTACTGAGGCA-1 1227 0.009294 3443.0 \n", "2635 2635 TTTCTACTTCCTCG-1 622 0.021971 1684.0 \n", "2636 2636 TTTGCATGAGAGGC-1 454 0.020548 1022.0 \n", "2637 2637 TTTGCATGCCTCAC-1 724 0.008065 1984.0 \n", "\n", " louvain \n", "0 CD4 T cells \n", "1 B cells \n", "2 CD4 T cells \n", "3 CD14+ Monocytes \n", "4 NK cells \n", "... ... \n", "2633 CD14+ Monocytes \n", "2634 B cells \n", "2635 B cells \n", "2636 B cells \n", "2637 CD4 T cells \n", "\n", "[2638 rows x 6 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment.obs.read().concat().to_pandas()" ] }, { "cell_type": "code", "execution_count": 28, "id": "74fb8a3e-9139-4711-ae12-6064222eeae0", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidvar_idn_cells
00TNFRSF4155
11CPSF3L202
22ATAD3C9
33C1orf86501
44RER1608
............
18331833ICOSLG34
18341834SUMO3570
18351835SLC19A131
18361836S100B94
18371837PRMT2588
\n", "

1838 rows × 3 columns

\n", "
" ], "text/plain": [ " soma_joinid var_id n_cells\n", "0 0 TNFRSF4 155\n", "1 1 CPSF3L 202\n", "2 2 ATAD3C 9\n", "3 3 C1orf86 501\n", "4 4 RER1 608\n", "... ... ... ...\n", "1833 1833 ICOSLG 34\n", "1834 1834 SUMO3 570\n", "1835 1835 SLC19A1 31\n", "1836 1836 S100B 94\n", "1837 1837 PRMT2 588\n", "\n", "[1838 rows x 3 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "var = experiment.ms[\"RNA\"].var\n", "var.read().concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "4fb5fbd0-f506-48a1-a0c5-717daa85e840", "metadata": { "tags": [] }, "source": [ "### Exercise: compute raw counts for a gene" ] }, { "cell_type": "markdown", "id": "0de38cb2-5131-4a05-a499-d24bf2e87c1e", "metadata": { "tags": [] }, "source": [ "In this exercise, we will compute the raw counts for a gene. We will only use slices, so at no point the `SparseNDArray` will be fully in memory.\n", "\n", "Let's start by looking at a specific gene (`ATAD3C`) in the `var` dataframe:" ] }, { "cell_type": "code", "execution_count": 29, "id": "2bd9d50a-5a04-4e65-8ac5-352f0dd64065", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidvar_id
02ATAD3C
\n", "
" ], "text/plain": [ " soma_joinid var_id\n", "0 2 ATAD3C" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "var.read(column_names=[\"soma_joinid\", \"var_id\"], value_filter=\"var_id == 'ATAD3C'\").concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "ba1a771d-426c-4dce-b27d-68f552cff383", "metadata": { "tags": [] }, "source": [ "In order to verify the raw counts, we need to move to the `raw` layer, which can be found in the experiment:" ] }, { "cell_type": "code", "execution_count": 30, "id": "879435ff-b38c-4da0-8459-1fa55e3a60ba", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment.ms[\"raw\"]" ] }, { "cell_type": "markdown", "id": "febd089f-731f-4052-a423-4428a9291616", "metadata": { "tags": [] }, "source": [ "Let's start by looking up the same gene in the raw `var` dataframe:" ] }, { "cell_type": "code", "execution_count": 31, "id": "f09b47e0-8501-4a1b-a734-6087197f9272", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidvar_id
030ATAD3C
\n", "
" ], "text/plain": [ " soma_joinid var_id\n", "0 30 ATAD3C" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_var = experiment[\"ms\"][\"raw\"].var\n", "raw_var.read(column_names=[\"soma_joinid\", \"var_id\"], value_filter=\"var_id == 'ATAD3C'\").concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "a2e06508-610f-48cb-9abc-0ce25089ddcc", "metadata": {}, "source": [ "Note the `soma_joinid` column. This is a column that can be used to join related SOMA objects in the experiment. In this case, it can be used to index the `raw.X` matrix second dimension. Therefore, we just need to slice across that dimension, convert the matrix and count the nonzero entries:" ] }, { "cell_type": "code", "execution_count": 32, "id": "924f8aa7-e8b8-4fda-b513-34442d482dec", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_raw = experiment[\"ms\"][\"raw\"].X[\"data\"]\n", "X_raw.read((slice(None), [30])).coos().concat().to_scipy().nnz" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }