{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d03b9481-43e7-4f16-8141-4c0ab305ec74",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Tutorial: Reading SOMA Objects"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2683d7ec-03f8-4a34-9403-4394420cd29c",
   "metadata": {},
   "source": [
    "In this notebook we'll learn how to read from various SOMA objects. We will assume familiarity with SOMA objects already, so it is recommended to go through the [Tutorial: SOMA Objects](https://github.com/single-cell-data/TileDB-SOMA/blob/main/apis/python/notebooks/tutorial_soma_objects.ipynb) before."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a66a7b99-6e25-4f46-9555-41edbc7fb3ee",
   "metadata": {
    "tags": []
   },
   "source": [
    "This implementation of SOMA relies on [TileDB](https://tiledb.com/), which is a storage format that allows working with large files without having to fully load them in memory. Files can be either read from disk or from a remote source, like an S3 bucket. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cbeef30-f87a-4d59-bee1-6a1dc865aefb",
   "metadata": {
    "tags": []
   },
   "source": [
    "The core feature of SOMA is to allow reading _subsets_ of the data using slices: only the portion of required data is read from disk/network.\n",
    "SOMA uses [Apache Arrow](https://arrow.apache.org/) as an intermediate in-memory storage. From here, the slices can be further converted into more familiar formats, like a scipy.sparse matrix or a numpy ndarray. Consult the [Python bindings for Apache Arrow documentation](https://arrow.apache.org/docs/python/index.html) for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fef57da-e665-4990-aebd-a89596031935",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this notebook, we will use the Peripheral Blood Mononuclear Cells (PBMC) dataset. We will focus on reading from its `obs` `DataFrame` and from the `X` `SparseNDArray`. This is a small dataset that can fit in memory, but we'll focus on operations that work on subsets of data that will work on larger datasets as well."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42228ff1-4660-4dd6-b627-75b54e6abcb8",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Reading a DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28401793-71c1-4d1c-ac9a-2fe255d8821d",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Introduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f843b57e-efd1-4a27-9778-c8b2c1aaa686",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import tiledbsoma"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "18d5412e-5bae-4706-bb79-2692635190ce",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import tarfile\n",
    "import tempfile\n",
    "\n",
    "sparse_uri = tempfile.mktemp()\n",
    "with tarfile.open(\"data/pbmc3k-sparse.tgz\") as handle:\n",
    "    handle.extractall(sparse_uri)\n",
    "experiment = tiledbsoma.Experiment.open(sparse_uri)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "566b4df7-b26a-487b-8d3f-8616bd84a23c",
   "metadata": {
    "tags": []
   },
   "source": [
    "All read operations need to be performed using the `.read()` method. For a `DataFrame`, we want to then call `.concat()` to obtain a [PyArrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c9f445b5-6fee-40a7-9ba8-37c9a72efb2f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pyarrow.Table\n",
       "soma_joinid: int64\n",
       "obs_id: large_string\n",
       "n_genes: int64\n",
       "percent_mito: float\n",
       "n_counts: float\n",
       "louvain: dictionary<values=string, indices=int32, ordered=0>\n",
       "----\n",
       "soma_joinid: [[0,1,2,3,4,...,2633,2634,2635,2636,2637]]\n",
       "obs_id: [[\"AAACATACAACCAC-1\",\"AAACATTGAGCTAC-1\",\"AAACATTGATCAGC-1\",\"AAACCGTGCTTCCG-1\",\"AAACCGTGTATGCG-1\",...,\"TTTCGAACTCTCAT-1\",\"TTTCTACTGAGGCA-1\",\"TTTCTACTTCCTCG-1\",\"TTTGCATGAGAGGC-1\",\"TTTGCATGCCTCAC-1\"]]\n",
       "n_genes: [[781,1352,1131,960,522,...,1155,1227,622,454,724]]\n",
       "percent_mito: [[0.030177759,0.037935957,0.008897362,0.017430846,0.012244898,...,0.021104366,0.00929422,0.021971496,0.020547945,0.008064516]]\n",
       "n_counts: [[2419,4903,3147,2639,980,...,3459,3443,1684,1022,1984]]\n",
       "louvain: [  -- dictionary:\n",
       "[\"CD4 T cells\",\"CD14+ Monocytes\",\"B cells\",\"CD8 T cells\",\"NK cells\",\"FCGR3A+ Monocytes\",\"Dendritic cells\",\"Megakaryocytes\"]  -- indices:\n",
       "[0,2,0,1,4,...,1,2,2,2,0]]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs = experiment.obs\n",
    "table = obs.read().concat()\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1005cc8-289f-470d-ad60-0ee0975b3fe5",
   "metadata": {
    "tags": []
   },
   "source": [
    "From here, we can directly use any of the PyArrow Table methods, for instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "11291dbd-3272-4c84-bed5-4e1b67f408b9",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pyarrow.Table\n",
       "soma_joinid: int64\n",
       "obs_id: large_string\n",
       "n_genes: int64\n",
       "percent_mito: float\n",
       "n_counts: float\n",
       "louvain: dictionary<values=string, indices=int32, ordered=0>\n",
       "----\n",
       "soma_joinid: [[270,1163,1891,926,277,...,2186,1522,662,1288,1840]]\n",
       "obs_id: [[\"ACGAACTGGCTATG-1\",\"CGATACGACAGGAG-1\",\"GGGCCAACCTTGGA-1\",\"CAGGTTGAGGATCT-1\",\"ACGAGGGACAGGAG-1\",...,\"TAGTCTTGGCTGTA-1\",\"GACGCTCTCTCTCG-1\",\"ATCTCAACCTCGAA-1\",\"CTAATAGAGCTATG-1\",\"GGCATATGGGGAGT-1\"]]\n",
       "n_genes: [[2455,2033,2020,2000,1997,...,270,267,246,239,212]]\n",
       "percent_mito: [[0.015774649,0.022166021,0.010576352,0.026962927,0.014631685,...,0,0.032258064,0,0.0016666667,0.012173913]]\n",
       "n_counts: [[8875,6722,8415,8011,7928,...,652,682,609,600,575]]\n",
       "louvain: [  -- dictionary:\n",
       "[\"CD4 T cells\",\"CD14+ Monocytes\",\"B cells\",\"CD8 T cells\",\"NK cells\",\"FCGR3A+ Monocytes\",\"Dendritic cells\",\"Megakaryocytes\"]  -- indices:\n",
       "[7,0,6,2,6,...,0,7,0,0,7]]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.sort_by([(\"n_genes\", \"descending\")])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe90e126-ca37-4444-acdf-c7120fe2bea8",
   "metadata": {
    "tags": []
   },
   "source": [
    "Alternatively, we can convert the `DataFrame` to a different format, like a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "f4073542-da95-4158-97c7-c1dd442de930",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>AAACATACAACCAC-1</td>\n",
       "      <td>781</td>\n",
       "      <td>0.030178</td>\n",
       "      <td>2419.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>AAACATTGAGCTAC-1</td>\n",
       "      <td>1352</td>\n",
       "      <td>0.037936</td>\n",
       "      <td>4903.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>AAACATTGATCAGC-1</td>\n",
       "      <td>1131</td>\n",
       "      <td>0.008897</td>\n",
       "      <td>3147.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>AAACCGTGCTTCCG-1</td>\n",
       "      <td>960</td>\n",
       "      <td>0.017431</td>\n",
       "      <td>2639.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>AAACCGTGTATGCG-1</td>\n",
       "      <td>522</td>\n",
       "      <td>0.012245</td>\n",
       "      <td>980.0</td>\n",
       "      <td>NK cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2633</th>\n",
       "      <td>2633</td>\n",
       "      <td>TTTCGAACTCTCAT-1</td>\n",
       "      <td>1155</td>\n",
       "      <td>0.021104</td>\n",
       "      <td>3459.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2634</th>\n",
       "      <td>2634</td>\n",
       "      <td>TTTCTACTGAGGCA-1</td>\n",
       "      <td>1227</td>\n",
       "      <td>0.009294</td>\n",
       "      <td>3443.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2635</th>\n",
       "      <td>2635</td>\n",
       "      <td>TTTCTACTTCCTCG-1</td>\n",
       "      <td>622</td>\n",
       "      <td>0.021971</td>\n",
       "      <td>1684.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2636</th>\n",
       "      <td>2636</td>\n",
       "      <td>TTTGCATGAGAGGC-1</td>\n",
       "      <td>454</td>\n",
       "      <td>0.020548</td>\n",
       "      <td>1022.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2637</th>\n",
       "      <td>2637</td>\n",
       "      <td>TTTGCATGCCTCAC-1</td>\n",
       "      <td>724</td>\n",
       "      <td>0.008065</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2638 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      soma_joinid            obs_id  n_genes  percent_mito  n_counts  \\\n",
       "0               0  AAACATACAACCAC-1      781      0.030178    2419.0   \n",
       "1               1  AAACATTGAGCTAC-1     1352      0.037936    4903.0   \n",
       "2               2  AAACATTGATCAGC-1     1131      0.008897    3147.0   \n",
       "3               3  AAACCGTGCTTCCG-1      960      0.017431    2639.0   \n",
       "4               4  AAACCGTGTATGCG-1      522      0.012245     980.0   \n",
       "...           ...               ...      ...           ...       ...   \n",
       "2633         2633  TTTCGAACTCTCAT-1     1155      0.021104    3459.0   \n",
       "2634         2634  TTTCTACTGAGGCA-1     1227      0.009294    3443.0   \n",
       "2635         2635  TTTCTACTTCCTCG-1      622      0.021971    1684.0   \n",
       "2636         2636  TTTGCATGAGAGGC-1      454      0.020548    1022.0   \n",
       "2637         2637  TTTGCATGCCTCAC-1      724      0.008065    1984.0   \n",
       "\n",
       "              louvain  \n",
       "0         CD4 T cells  \n",
       "1             B cells  \n",
       "2         CD4 T cells  \n",
       "3     CD14+ Monocytes  \n",
       "4            NK cells  \n",
       "...               ...  \n",
       "2633  CD14+ Monocytes  \n",
       "2634          B cells  \n",
       "2635          B cells  \n",
       "2636          B cells  \n",
       "2637      CD4 T cells  \n",
       "\n",
       "[2638 rows x 6 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54832ad3-b548-4c62-9c1c-1edd14c002b3",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Reading slices of data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cab4fe3-76ac-4507-94e5-bc038b35a1cd",
   "metadata": {
    "tags": []
   },
   "source": [
    "As previously mentioned, the core feature of SOMA is reading slices of the data without fetching the whole dataset in memory. To do that, the `.read()` method supports a `coords` parameter that allows data slicing. \n",
    "\n",
    "Before we do that, let's take a look at the schema of the `obs` dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "83a6965d-441a-473c-8307-fda71a68ed11",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "soma_joinid: int64 not null\n",
       "obs_id: large_string\n",
       "n_genes: int64\n",
       "percent_mito: float\n",
       "n_counts: float\n",
       "louvain: dictionary<values=string, indices=int32, ordered=0>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.schema"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc5acf67-36c3-40e0-8aa5-2a9cde7c69b8",
   "metadata": {
    "tags": []
   },
   "source": [
    "And also its domain:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "92fe8e3b-13ad-4922-9bbd-e9fb4600f56e",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((0, 2637),)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.domain"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72d5ccfc-82b7-44d5-af80-e6ac6c2a135b",
   "metadata": {},
   "source": [
    "With a SOMA DataFrame, you can only slice across an indexed column, so let's look at the indexed columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e7e79037-9494-4b0b-a898-d9364fa1758b",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('soma_joinid',)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.index_column_names"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25095993-9343-4483-9c86-5ff3c4a40df6",
   "metadata": {},
   "source": [
    "In this case our index consists of just `soma_joinid`, which is an integer column that can be used to join other SOMA objects in the same experiment. \n",
    "\n",
    "\n",
    "Let's look at a few ways to slice the dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6245b69-da9e-4d1f-971c-b86e3a2b69aa",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Select a single row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "042da676-1dc7-4916-bccd-dbcf9e753de8",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>AAACATACAACCAC-1</td>\n",
       "      <td>781</td>\n",
       "      <td>0.030178</td>\n",
       "      <td>2419.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   soma_joinid            obs_id  n_genes  percent_mito  n_counts      louvain\n",
       "0            0  AAACATACAACCAC-1      781      0.030178    2419.0  CD4 T cells"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read([[0]]).concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c17e4c73-fca6-440f-b0a4-343e36c604f5",
   "metadata": {},
   "source": [
    "#### Select multiple, non contiguous rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d77e6897-7940-49fe-a269-548171336fc9",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>AAACATTGATCAGC-1</td>\n",
       "      <td>1131</td>\n",
       "      <td>0.008897</td>\n",
       "      <td>3147.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>AAACGCACTGGTAC-1</td>\n",
       "      <td>782</td>\n",
       "      <td>0.016644</td>\n",
       "      <td>2163.0</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   soma_joinid            obs_id  n_genes  percent_mito  n_counts      louvain\n",
       "0            2  AAACATTGATCAGC-1     1131      0.008897    3147.0  CD4 T cells\n",
       "1            5  AAACGCACTGGTAC-1      782      0.016644    2163.0  CD8 T cells"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read([[2, 5]]).concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33b6cf15-5005-4aef-9212-cdd34be8a9ba",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Select a slice of rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "dfba21c5-504c-4644-ad53-7cf7865ebf31",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>AAACATACAACCAC-1</td>\n",
       "      <td>781</td>\n",
       "      <td>0.030178</td>\n",
       "      <td>2419.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>AAACATTGAGCTAC-1</td>\n",
       "      <td>1352</td>\n",
       "      <td>0.037936</td>\n",
       "      <td>4903.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>AAACATTGATCAGC-1</td>\n",
       "      <td>1131</td>\n",
       "      <td>0.008897</td>\n",
       "      <td>3147.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>AAACCGTGCTTCCG-1</td>\n",
       "      <td>960</td>\n",
       "      <td>0.017431</td>\n",
       "      <td>2639.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>AAACCGTGTATGCG-1</td>\n",
       "      <td>522</td>\n",
       "      <td>0.012245</td>\n",
       "      <td>980.0</td>\n",
       "      <td>NK cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>AAACGCACTGGTAC-1</td>\n",
       "      <td>782</td>\n",
       "      <td>0.016644</td>\n",
       "      <td>2163.0</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   soma_joinid            obs_id  n_genes  percent_mito  n_counts  \\\n",
       "0            0  AAACATACAACCAC-1      781      0.030178    2419.0   \n",
       "1            1  AAACATTGAGCTAC-1     1352      0.037936    4903.0   \n",
       "2            2  AAACATTGATCAGC-1     1131      0.008897    3147.0   \n",
       "3            3  AAACCGTGCTTCCG-1      960      0.017431    2639.0   \n",
       "4            4  AAACCGTGTATGCG-1      522      0.012245     980.0   \n",
       "5            5  AAACGCACTGGTAC-1      782      0.016644    2163.0   \n",
       "\n",
       "           louvain  \n",
       "0      CD4 T cells  \n",
       "1          B cells  \n",
       "2      CD4 T cells  \n",
       "3  CD14+ Monocytes  \n",
       "4         NK cells  \n",
       "5      CD8 T cells  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read([slice(0, 5)]).concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50f00bbf-6bd6-4279-9e92-b4bf544c4702",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Select a subset of columns only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a91e68e7-5c06-4241-b4ee-42a59794e520",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>obs_id</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AAACATACAACCAC-1</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AAACATTGAGCTAC-1</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AAACATTGATCAGC-1</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AAACCGTGCTTCCG-1</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AAACCGTGTATGCG-1</td>\n",
       "      <td>NK cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>AAACGCACTGGTAC-1</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             obs_id          louvain\n",
       "0  AAACATACAACCAC-1      CD4 T cells\n",
       "1  AAACATTGAGCTAC-1          B cells\n",
       "2  AAACATTGATCAGC-1      CD4 T cells\n",
       "3  AAACCGTGCTTCCG-1  CD14+ Monocytes\n",
       "4  AAACCGTGTATGCG-1         NK cells\n",
       "5  AAACGCACTGGTAC-1      CD8 T cells"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read([slice(0, 5)], column_names=[\"obs_id\", \"louvain\"]).concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bc1212e-ac25-4278-a14a-b0b876337ff6",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Filter data using complex queries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d60a0981-38d6-4460-a4b3-d3431ce43b40",
   "metadata": {
    "tags": []
   },
   "source": [
    "SOMA also allows to filter data using more complex queries. For a more detailed reference, take a look at the [query condition](https://github.com/single-cell-data/TileDB-SOMA/blob/main/apis/python/src/tiledbsoma/_query_condition.py) source code.\n",
    "\n",
    "Here are a few examples:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "266a6574-b1f3-41f0-94a3-1ccfec8b82af",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Filter all cells with a Louvain categorization of \"B cells\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "77feed71-7ac9-44d5-af35-c37601067092",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>AAACATTGAGCTAC-1</td>\n",
       "      <td>1352</td>\n",
       "      <td>0.037936</td>\n",
       "      <td>4903.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10</td>\n",
       "      <td>AAACTTGAAAAACG-1</td>\n",
       "      <td>1116</td>\n",
       "      <td>0.026316</td>\n",
       "      <td>3914.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18</td>\n",
       "      <td>AAAGGCCTGTCTAG-1</td>\n",
       "      <td>1446</td>\n",
       "      <td>0.015283</td>\n",
       "      <td>4973.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>19</td>\n",
       "      <td>AAAGTTTGATCACG-1</td>\n",
       "      <td>446</td>\n",
       "      <td>0.034700</td>\n",
       "      <td>1268.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>20</td>\n",
       "      <td>AAAGTTTGGGGTGA-1</td>\n",
       "      <td>1020</td>\n",
       "      <td>0.025907</td>\n",
       "      <td>3281.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>337</th>\n",
       "      <td>2628</td>\n",
       "      <td>TTTCAGTGTCACGA-1</td>\n",
       "      <td>700</td>\n",
       "      <td>0.034314</td>\n",
       "      <td>1632.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>338</th>\n",
       "      <td>2630</td>\n",
       "      <td>TTTCAGTGTGCAGT-1</td>\n",
       "      <td>637</td>\n",
       "      <td>0.018925</td>\n",
       "      <td>1321.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>339</th>\n",
       "      <td>2634</td>\n",
       "      <td>TTTCTACTGAGGCA-1</td>\n",
       "      <td>1227</td>\n",
       "      <td>0.009294</td>\n",
       "      <td>3443.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>340</th>\n",
       "      <td>2635</td>\n",
       "      <td>TTTCTACTTCCTCG-1</td>\n",
       "      <td>622</td>\n",
       "      <td>0.021971</td>\n",
       "      <td>1684.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>341</th>\n",
       "      <td>2636</td>\n",
       "      <td>TTTGCATGAGAGGC-1</td>\n",
       "      <td>454</td>\n",
       "      <td>0.020548</td>\n",
       "      <td>1022.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>342 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     soma_joinid            obs_id  n_genes  percent_mito  n_counts  louvain\n",
       "0              1  AAACATTGAGCTAC-1     1352      0.037936    4903.0  B cells\n",
       "1             10  AAACTTGAAAAACG-1     1116      0.026316    3914.0  B cells\n",
       "2             18  AAAGGCCTGTCTAG-1     1446      0.015283    4973.0  B cells\n",
       "3             19  AAAGTTTGATCACG-1      446      0.034700    1268.0  B cells\n",
       "4             20  AAAGTTTGGGGTGA-1     1020      0.025907    3281.0  B cells\n",
       "..           ...               ...      ...           ...       ...      ...\n",
       "337         2628  TTTCAGTGTCACGA-1      700      0.034314    1632.0  B cells\n",
       "338         2630  TTTCAGTGTGCAGT-1      637      0.018925    1321.0  B cells\n",
       "339         2634  TTTCTACTGAGGCA-1     1227      0.009294    3443.0  B cells\n",
       "340         2635  TTTCTACTTCCTCG-1      622      0.021971    1684.0  B cells\n",
       "341         2636  TTTGCATGAGAGGC-1      454      0.020548    1022.0  B cells\n",
       "\n",
       "[342 rows x 6 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read(value_filter=\"louvain == 'B cells'\").concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "751027d9-2989-47db-9531-7e3e706de942",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Filter all cells with a Louvain categorization of either \"CD4 T cells\" or \"CD8 T cells\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "157d42e0-89d8-4ab2-a4c9-8c9f0f16f943",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>AAACATACAACCAC-1</td>\n",
       "      <td>781</td>\n",
       "      <td>0.030178</td>\n",
       "      <td>2419.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>AAACATTGATCAGC-1</td>\n",
       "      <td>1131</td>\n",
       "      <td>0.008897</td>\n",
       "      <td>3147.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>AAACGCACTGGTAC-1</td>\n",
       "      <td>782</td>\n",
       "      <td>0.016644</td>\n",
       "      <td>2163.0</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6</td>\n",
       "      <td>AAACGCTGACCAGT-1</td>\n",
       "      <td>783</td>\n",
       "      <td>0.038161</td>\n",
       "      <td>2175.0</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>7</td>\n",
       "      <td>AAACGCTGGTTCTT-1</td>\n",
       "      <td>790</td>\n",
       "      <td>0.030973</td>\n",
       "      <td>2260.0</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1455</th>\n",
       "      <td>2621</td>\n",
       "      <td>TTTAGCTGATACCG-1</td>\n",
       "      <td>887</td>\n",
       "      <td>0.022876</td>\n",
       "      <td>2754.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1456</th>\n",
       "      <td>2626</td>\n",
       "      <td>TTTCACGAGGTTCA-1</td>\n",
       "      <td>721</td>\n",
       "      <td>0.013261</td>\n",
       "      <td>2036.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1457</th>\n",
       "      <td>2627</td>\n",
       "      <td>TTTCAGTGGAAGGC-1</td>\n",
       "      <td>692</td>\n",
       "      <td>0.015169</td>\n",
       "      <td>1780.0</td>\n",
       "      <td>CD8 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1458</th>\n",
       "      <td>2631</td>\n",
       "      <td>TTTCCAGAGGTGAG-1</td>\n",
       "      <td>873</td>\n",
       "      <td>0.006859</td>\n",
       "      <td>2187.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1459</th>\n",
       "      <td>2637</td>\n",
       "      <td>TTTGCATGCCTCAC-1</td>\n",
       "      <td>724</td>\n",
       "      <td>0.008065</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1460 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      soma_joinid            obs_id  n_genes  percent_mito  n_counts  \\\n",
       "0               0  AAACATACAACCAC-1      781      0.030178    2419.0   \n",
       "1               2  AAACATTGATCAGC-1     1131      0.008897    3147.0   \n",
       "2               5  AAACGCACTGGTAC-1      782      0.016644    2163.0   \n",
       "3               6  AAACGCTGACCAGT-1      783      0.038161    2175.0   \n",
       "4               7  AAACGCTGGTTCTT-1      790      0.030973    2260.0   \n",
       "...           ...               ...      ...           ...       ...   \n",
       "1455         2621  TTTAGCTGATACCG-1      887      0.022876    2754.0   \n",
       "1456         2626  TTTCACGAGGTTCA-1      721      0.013261    2036.0   \n",
       "1457         2627  TTTCAGTGGAAGGC-1      692      0.015169    1780.0   \n",
       "1458         2631  TTTCCAGAGGTGAG-1      873      0.006859    2187.0   \n",
       "1459         2637  TTTGCATGCCTCAC-1      724      0.008065    1984.0   \n",
       "\n",
       "          louvain  \n",
       "0     CD4 T cells  \n",
       "1     CD4 T cells  \n",
       "2     CD8 T cells  \n",
       "3     CD8 T cells  \n",
       "4     CD8 T cells  \n",
       "...           ...  \n",
       "1455  CD4 T cells  \n",
       "1456  CD4 T cells  \n",
       "1457  CD8 T cells  \n",
       "1458  CD4 T cells  \n",
       "1459  CD4 T cells  \n",
       "\n",
       "[1460 rows x 6 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read(value_filter=\"(louvain == 'CD4 T cells') or (louvain == 'CD8 T cells')\").concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "518a77ab-49f1-47bb-8114-e6f62f32616d",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Filter all cells with a Louvain categorization of \"CD4 T cells\" and more than 1500 genes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8bf9180b-5ebb-4a19-a602-b9495b33617f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>26</td>\n",
       "      <td>AAATCAACCCTATT-1</td>\n",
       "      <td>1545</td>\n",
       "      <td>0.024313</td>\n",
       "      <td>5676.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>357</td>\n",
       "      <td>ACTCTCCTGCATAC-1</td>\n",
       "      <td>1750</td>\n",
       "      <td>0.017436</td>\n",
       "      <td>5850.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>473</td>\n",
       "      <td>AGCTGCCTTTCATC-1</td>\n",
       "      <td>1703</td>\n",
       "      <td>0.029547</td>\n",
       "      <td>5212.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>945</td>\n",
       "      <td>CATACTTGGGTTAC-1</td>\n",
       "      <td>1938</td>\n",
       "      <td>0.023580</td>\n",
       "      <td>7167.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1163</td>\n",
       "      <td>CGATACGACAGGAG-1</td>\n",
       "      <td>2033</td>\n",
       "      <td>0.022166</td>\n",
       "      <td>6722.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1320</td>\n",
       "      <td>CTATACTGTTCGTT-1</td>\n",
       "      <td>1543</td>\n",
       "      <td>0.012395</td>\n",
       "      <td>4760.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1548</td>\n",
       "      <td>GAGCATACTTTGCT-1</td>\n",
       "      <td>1753</td>\n",
       "      <td>0.016739</td>\n",
       "      <td>6691.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1993</td>\n",
       "      <td>GTGATGACAAGTGA-1</td>\n",
       "      <td>1819</td>\n",
       "      <td>0.021172</td>\n",
       "      <td>6329.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2313</td>\n",
       "      <td>TCGGACCTGTACAC-1</td>\n",
       "      <td>1567</td>\n",
       "      <td>0.014288</td>\n",
       "      <td>5599.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2365</td>\n",
       "      <td>TGAGACACAAGGTA-1</td>\n",
       "      <td>1549</td>\n",
       "      <td>0.013242</td>\n",
       "      <td>5135.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   soma_joinid            obs_id  n_genes  percent_mito  n_counts      louvain\n",
       "0           26  AAATCAACCCTATT-1     1545      0.024313    5676.0  CD4 T cells\n",
       "1          357  ACTCTCCTGCATAC-1     1750      0.017436    5850.0  CD4 T cells\n",
       "2          473  AGCTGCCTTTCATC-1     1703      0.029547    5212.0  CD4 T cells\n",
       "3          945  CATACTTGGGTTAC-1     1938      0.023580    7167.0  CD4 T cells\n",
       "4         1163  CGATACGACAGGAG-1     2033      0.022166    6722.0  CD4 T cells\n",
       "5         1320  CTATACTGTTCGTT-1     1543      0.012395    4760.0  CD4 T cells\n",
       "6         1548  GAGCATACTTTGCT-1     1753      0.016739    6691.0  CD4 T cells\n",
       "7         1993  GTGATGACAAGTGA-1     1819      0.021172    6329.0  CD4 T cells\n",
       "8         2313  TCGGACCTGTACAC-1     1567      0.014288    5599.0  CD4 T cells\n",
       "9         2365  TGAGACACAAGGTA-1     1549      0.013242    5135.0  CD4 T cells"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs.read(value_filter=\"(louvain == 'CD4 T cells') and (n_genes > 1500)\").concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa8abca5-0ba4-40a9-befb-e1d86f6bfd79",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Reading a SparseNDArray"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8032d56c-a472-4f35-b6db-cd03ee1e7fcd",
   "metadata": {},
   "source": [
    "For `SparseNDArray`, let's consider the X matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "f5dc0937-9022-4cb3-8ee7-73bdfe1f234d",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<SparseNDArray 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/RNA/X/data' (open for 'r')>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = experiment.ms[\"RNA\"].X[\"data\"]\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e743e0d-25f1-4ddd-9ffa-936affce1fd8",
   "metadata": {
    "tags": []
   },
   "source": [
    "Similarly to `DataFrame`, we need to use the `.read()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "b580744c-0e2a-4a65-b86b-6c2f1eb9b3ae",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tiledbsoma._sparse_nd_array.SparseNDArrayRead at 0x119ba4ad0>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d64afe7-6fe9-4872-864a-88329042fe72",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this case, we have two options. Let's start by converting this into an [Arrow SparseCOOTensor](https://arrow.apache.org/docs/cpp/api/tensor.html#sparse-tensors):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "1bbcc301-815e-482e-9558-5a1cd2e117c6",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pyarrow.SparseCOOTensor>\n",
       "type: float\n",
       "shape: (2638, 1838)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor = X.read().coos().concat()\n",
    "tensor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c195a65-93ee-43ab-8eec-692c70d29512",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this example, we obtain a 2-dimensional tensor (a matrix). Note that `shape` here indicates the _capacity_ of the tensor, rather than the actual size. \n",
    "\n",
    "By default, a `SparseNDArray` gets created with a much higher capacity to accommodate further writes. Since this is a read scenario, and the shape of the matrix is known, we can call `.coos()` with a parameter so that the `SparseNDArray` is resized accordingly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "cbeea976-f00c-47b6-b8d1-4ea1b31afd25",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pyarrow.SparseCOOTensor>\n",
       "type: float\n",
       "shape: (2638, 1838)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n_obs = len(obs)\n",
    "n_var = len(experiment.ms[\"RNA\"].var)\n",
    "\n",
    "tensor = X.read().coos((n_obs, n_var)).concat()\n",
    "tensor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df7d1739-ec49-41dc-8022-add924c2767c",
   "metadata": {},
   "source": [
    "We can convert this to a `scipy.sparse.coo_matrix`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "28d04c96-01a3-429e-a5a8-84ec8bca0453",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<COOrdinate sparse matrix of dtype 'float32'\n",
       "\twith 4848644 stored elements and shape (2638, 1838)>"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tensor.to_scipy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e2210cf-35d2-49d4-a7d3-261c01469e5c",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Reading slices of data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b78cdfe0-6cfc-4183-9a74-023668b21208",
   "metadata": {},
   "source": [
    "Similarly to `DataFrame`, we can retrieve subsets of the data that can fit in memory. This is particularly important with `SparseNDArray`s since often those are several gigabytes. \n",
    "\n",
    "Unlike `DataFrame`s, `SparseNDArray`s are always indexed using an offset (zero-based) integer on each dimension. Therefore, if the array is N-dimensional, the `.read()` method can accept a n-tuple (or list) argument that specifies how to slice the array. An empty element or `slice(None)` means select all in that dimension.\n",
    "\n",
    "For example, here's how to fetch the first 5 rows of the matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "6f757ce0-d9dc-44bb-99f6-3be709edff0e",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pyarrow.SparseCOOTensor>\n",
       "type: float\n",
       "shape: (2638, 1838)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Y = X.read([slice(0, 5)]).coos().concat()\n",
    "Y "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f99d59e-8be4-4903-94ac-8c740caacac3",
   "metadata": {
    "tags": []
   },
   "source": [
    "Being only 5 rows, this slice can fit in memory even for bigger matrices than the one used in the example. Note that we can't simply materialize to a dense matrix since the shape is too big (running `Y.to_scipy().todense()` will raise an error), so we need to set bounding boxes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "32f21b63-83a7-45e3-a18b-f44f068b1697",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pyarrow.SparseCOOTensor>\n",
       "type: float\n",
       "shape: (2638, 1838)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Y = X.read((slice(0, 5),)).coos((n_obs, n_var)).concat()\n",
    "Y"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fee7237-aa55-408c-bee2-3cf4f6844831",
   "metadata": {
    "tags": []
   },
   "source": [
    "Now we can get a dense representation of it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "45874532-327a-46fa-ad27-7043bd80e8f9",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,\n",
       "         -0.2090951 , -0.5312034 ],\n",
       "        [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,\n",
       "         -0.31314576, -0.5966544 ],\n",
       "        [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,\n",
       "         -0.17087643,  1.379     ],\n",
       "        ...,\n",
       "        [ 0.        ,  0.        ,  0.        , ...,  0.        ,\n",
       "          0.        ,  0.        ],\n",
       "        [ 0.        ,  0.        ,  0.        , ...,  0.        ,\n",
       "          0.        ,  0.        ],\n",
       "        [ 0.        ,  0.        ,  0.        , ...,  0.        ,\n",
       "          0.        ,  0.        ]], dtype=float32)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Y.to_scipy().todense()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72f42cb0-e7f6-45dc-a56a-b5b8c4c068b1",
   "metadata": {},
   "source": [
    "Alternatively, we can convert it to a `scipy.sparse.csr_matrix` which allows to select specific rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "972de6ca-19a7-49f0-8fe9-7ece199cad1d",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Compressed Sparse Row sparse matrix of dtype 'float32'\n",
       "\twith 11028 stored elements and shape (2638, 1838)>"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Z = Y.to_scipy().tocsr()\n",
    "Z"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "10c5f21e-5ec2-44ee-bae8-302d5f874a41",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Compressed Sparse Row sparse matrix of dtype 'float32'\n",
       "\twith 1838 stored elements and shape (1, 1838)>"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Z.getrow(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e24d636-dd66-4378-a652-d6b5086c76b1",
   "metadata": {},
   "source": [
    "Similarly, we can slice the original `SparseNDArray` using single rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "10856361-4b6e-476b-9938-a52ef50e6db1",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "matrix([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,\n",
       "         -0.2090951 , -0.5312034 ],\n",
       "        [ 0.        ,  0.        ,  0.        , ...,  0.        ,\n",
       "          0.        ,  0.        ],\n",
       "        [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,\n",
       "         -0.17087643,  1.379     ],\n",
       "        ...,\n",
       "        [ 0.        ,  0.        ,  0.        , ...,  0.        ,\n",
       "          0.        ,  0.        ],\n",
       "        [ 0.        ,  0.        ,  0.        , ...,  0.        ,\n",
       "          0.        ,  0.        ],\n",
       "        [ 0.        ,  0.        ,  0.        , ...,  0.        ,\n",
       "          0.        ,  0.        ]], dtype=float32)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.read([[0,2]]).coos((n_obs, n_var)).concat().to_scipy().todense()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c5eba56-de07-4025-8ccc-3dd4f4fedb90",
   "metadata": {},
   "source": [
    "The same approach can be used to filter across all the dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "45f2aa69-238e-478f-8c3a-9d4c90f3e507",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>obs_id</th>\n",
       "      <th>n_genes</th>\n",
       "      <th>percent_mito</th>\n",
       "      <th>n_counts</th>\n",
       "      <th>louvain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>AAACATACAACCAC-1</td>\n",
       "      <td>781</td>\n",
       "      <td>0.030178</td>\n",
       "      <td>2419.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>AAACATTGAGCTAC-1</td>\n",
       "      <td>1352</td>\n",
       "      <td>0.037936</td>\n",
       "      <td>4903.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>AAACATTGATCAGC-1</td>\n",
       "      <td>1131</td>\n",
       "      <td>0.008897</td>\n",
       "      <td>3147.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>AAACCGTGCTTCCG-1</td>\n",
       "      <td>960</td>\n",
       "      <td>0.017431</td>\n",
       "      <td>2639.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>AAACCGTGTATGCG-1</td>\n",
       "      <td>522</td>\n",
       "      <td>0.012245</td>\n",
       "      <td>980.0</td>\n",
       "      <td>NK cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2633</th>\n",
       "      <td>2633</td>\n",
       "      <td>TTTCGAACTCTCAT-1</td>\n",
       "      <td>1155</td>\n",
       "      <td>0.021104</td>\n",
       "      <td>3459.0</td>\n",
       "      <td>CD14+ Monocytes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2634</th>\n",
       "      <td>2634</td>\n",
       "      <td>TTTCTACTGAGGCA-1</td>\n",
       "      <td>1227</td>\n",
       "      <td>0.009294</td>\n",
       "      <td>3443.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2635</th>\n",
       "      <td>2635</td>\n",
       "      <td>TTTCTACTTCCTCG-1</td>\n",
       "      <td>622</td>\n",
       "      <td>0.021971</td>\n",
       "      <td>1684.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2636</th>\n",
       "      <td>2636</td>\n",
       "      <td>TTTGCATGAGAGGC-1</td>\n",
       "      <td>454</td>\n",
       "      <td>0.020548</td>\n",
       "      <td>1022.0</td>\n",
       "      <td>B cells</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2637</th>\n",
       "      <td>2637</td>\n",
       "      <td>TTTGCATGCCTCAC-1</td>\n",
       "      <td>724</td>\n",
       "      <td>0.008065</td>\n",
       "      <td>1984.0</td>\n",
       "      <td>CD4 T cells</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2638 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      soma_joinid            obs_id  n_genes  percent_mito  n_counts  \\\n",
       "0               0  AAACATACAACCAC-1      781      0.030178    2419.0   \n",
       "1               1  AAACATTGAGCTAC-1     1352      0.037936    4903.0   \n",
       "2               2  AAACATTGATCAGC-1     1131      0.008897    3147.0   \n",
       "3               3  AAACCGTGCTTCCG-1      960      0.017431    2639.0   \n",
       "4               4  AAACCGTGTATGCG-1      522      0.012245     980.0   \n",
       "...           ...               ...      ...           ...       ...   \n",
       "2633         2633  TTTCGAACTCTCAT-1     1155      0.021104    3459.0   \n",
       "2634         2634  TTTCTACTGAGGCA-1     1227      0.009294    3443.0   \n",
       "2635         2635  TTTCTACTTCCTCG-1      622      0.021971    1684.0   \n",
       "2636         2636  TTTGCATGAGAGGC-1      454      0.020548    1022.0   \n",
       "2637         2637  TTTGCATGCCTCAC-1      724      0.008065    1984.0   \n",
       "\n",
       "              louvain  \n",
       "0         CD4 T cells  \n",
       "1             B cells  \n",
       "2         CD4 T cells  \n",
       "3     CD14+ Monocytes  \n",
       "4            NK cells  \n",
       "...               ...  \n",
       "2633  CD14+ Monocytes  \n",
       "2634          B cells  \n",
       "2635          B cells  \n",
       "2636          B cells  \n",
       "2637      CD4 T cells  \n",
       "\n",
       "[2638 rows x 6 columns]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.obs.read().concat().to_pandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "74fb8a3e-9139-4711-ae12-6064222eeae0",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>var_id</th>\n",
       "      <th>n_cells</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>TNFRSF4</td>\n",
       "      <td>155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>CPSF3L</td>\n",
       "      <td>202</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>ATAD3C</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>C1orf86</td>\n",
       "      <td>501</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>RER1</td>\n",
       "      <td>608</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1833</th>\n",
       "      <td>1833</td>\n",
       "      <td>ICOSLG</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1834</th>\n",
       "      <td>1834</td>\n",
       "      <td>SUMO3</td>\n",
       "      <td>570</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1835</th>\n",
       "      <td>1835</td>\n",
       "      <td>SLC19A1</td>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1836</th>\n",
       "      <td>1836</td>\n",
       "      <td>S100B</td>\n",
       "      <td>94</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1837</th>\n",
       "      <td>1837</td>\n",
       "      <td>PRMT2</td>\n",
       "      <td>588</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1838 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      soma_joinid   var_id  n_cells\n",
       "0               0  TNFRSF4      155\n",
       "1               1   CPSF3L      202\n",
       "2               2   ATAD3C        9\n",
       "3               3  C1orf86      501\n",
       "4               4     RER1      608\n",
       "...           ...      ...      ...\n",
       "1833         1833   ICOSLG       34\n",
       "1834         1834    SUMO3      570\n",
       "1835         1835  SLC19A1       31\n",
       "1836         1836    S100B       94\n",
       "1837         1837    PRMT2      588\n",
       "\n",
       "[1838 rows x 3 columns]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "var = experiment.ms[\"RNA\"].var\n",
    "var.read().concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fb5fbd0-f506-48a1-a0c5-717daa85e840",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Exercise: compute raw counts for a gene"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0de38cb2-5131-4a05-a499-d24bf2e87c1e",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this exercise, we will compute the raw counts for a gene. We will only use slices, so at no point the `SparseNDArray` will be fully in memory.\n",
    "\n",
    "Let's start by looking at a specific gene (`ATAD3C`) in the `var` dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "2bd9d50a-5a04-4e65-8ac5-352f0dd64065",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>var_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>ATAD3C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   soma_joinid  var_id\n",
       "0            2  ATAD3C"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "var.read(column_names=[\"soma_joinid\", \"var_id\"], value_filter=\"var_id == 'ATAD3C'\").concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba1a771d-426c-4dce-b27d-68f552cff383",
   "metadata": {
    "tags": []
   },
   "source": [
    "In order to verify the raw counts, we need to move to the `raw` layer, which can be found in the experiment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "879435ff-b38c-4da0-8459-1fa55e3a60ba",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Measurement 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw' (open for 'r') (2 items)\n",
       "    'X': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw/X' (unopened)\n",
       "    'var': 'file:///var/folders/7l/_wsjyk5d4p3dz3kbz7wxn7t00000gn/T/tmpfe5a_4au/ms/raw/var' (unopened)>"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.ms[\"raw\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "febd089f-731f-4052-a423-4428a9291616",
   "metadata": {
    "tags": []
   },
   "source": [
    "Let's start by looking up the same gene in the raw `var` dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "f09b47e0-8501-4a1b-a734-6087197f9272",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>var_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>30</td>\n",
       "      <td>ATAD3C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   soma_joinid  var_id\n",
       "0           30  ATAD3C"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_var = experiment[\"ms\"][\"raw\"].var\n",
    "raw_var.read(column_names=[\"soma_joinid\", \"var_id\"], value_filter=\"var_id == 'ATAD3C'\").concat().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2e06508-610f-48cb-9abc-0ce25089ddcc",
   "metadata": {},
   "source": [
    "Note the `soma_joinid` column. This is a column that can be used to join related SOMA objects in the experiment. In this case, it can be used to index the `raw.X` matrix second dimension. Therefore, we just need to slice across that dimension, convert the matrix and count the nonzero entries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "924f8aa7-e8b8-4fda-b513-34442d482dec",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_raw = experiment[\"ms\"][\"raw\"].X[\"data\"]\n",
    "X_raw.read((slice(None), [30])).coos().concat().to_scipy().nnz"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}