{ "cells": [ { "cell_type": "markdown", "id": "2cf7c05c-f723-489d-8c39-3e2841f655b0", "metadata": {}, "source": [ "# Tutorial: SOMA Objects\n", "\n", "In this notebook, we'll go through the various objects available as part of the SOMA API. The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics. " ] }, { "cell_type": "markdown", "id": "167dba53-7da6-4984-bbe7-a5416e60325d", "metadata": {}, "source": [ "We'll start by importing `tiledbsoma`." ] }, { "cell_type": "code", "execution_count": 1, "id": "ab458224-5353-4e15-baa9-46689729e071", "metadata": { "tags": [] }, "outputs": [], "source": [ "import tiledbsoma" ] }, { "cell_type": "markdown", "id": "ca9f7272-09e0-4eda-a569-8796a14bf776", "metadata": { "tags": [] }, "source": [ "## Experiment" ] }, { "cell_type": "markdown", "id": "41358011-b835-4c3a-a75e-79a80f4cc3a1", "metadata": {}, "source": [ "An `Experiment` is a class that represents a single-cell experiment. It always contains two objects:\n", "1. `obs`: A `DataFrame` with primary annotations on the observation axis.\n", "2. `ms`: A `Collection` of measurements." ] }, { "cell_type": "markdown", "id": "988b4245-0cbb-452b-bdc0-6422b03116ef", "metadata": {}, "source": [ "Let's unpack and open the experiment:" ] }, { "cell_type": "code", "execution_count": 2, "id": "2ee55b5c-94a3-4499-85bd-0fa167494aa5", "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "import tempfile\n", "\n", "dense_uri = tempfile.mktemp()\n", "with tarfile.open(\"data/pbmc3k-dense.tgz\") as handle:\n", " handle.extractall(dense_uri)\n", "experiment = tiledbsoma.Experiment.open(dense_uri)" ] }, { "cell_type": "markdown", "id": "18ed8201-084b-4c09-bd45-9ad769318d3c", "metadata": { "tags": [] }, "source": [ "Each object within the experiment can be opened like this:" ] }, { "cell_type": "code", "execution_count": 3, "id": "228e9411-434e-4c55-8fb4-fef3216dca08", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment.ms" ] }, { "cell_type": "code", "execution_count": 4, "id": "5d92e331-5c6c-4971-b956-442996d5efa9", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment.obs" ] }, { "cell_type": "markdown", "id": "ebbd3605-7601-4142-8979-4748966d91d7", "metadata": { "tags": [] }, "source": [ "Note that by default an experiment is opened lazily, i.e. only the minimal requested objects are opened. \n", "\n", "Also, opening an object doesn't mean that it will entirely be fetched in memory. It only returns a pointer to the object on disk." ] }, { "cell_type": "markdown", "id": "ccb496d3-09c7-4627-9ed1-eca5a87dc4b4", "metadata": { "tags": [] }, "source": [ "## DataFrame" ] }, { "cell_type": "markdown", "id": "71bb70de-5665-4e25-97bf-32d28d383f66", "metadata": { "tags": [] }, "source": [ "A `DataFrame` is a multi-column table with a user-defined schema. The schema is expressed as an Arrow Schema, and defines the column names and value types." ] }, { "cell_type": "markdown", "id": "7cfb6c9f-9101-4083-908a-c61c6b088110", "metadata": { "tags": [] }, "source": [ "As an example, let's take a look at `obs`, which is represented as a SOMA DataFrame.\n", "\n", "We can inspect the schema using `.schema`:" ] }, { "cell_type": "code", "execution_count": 5, "id": "2824acd0-1185-49d8-a61b-2c8e6c9ea261", "metadata": { "tags": [] }, "outputs": [], "source": [ "obs = experiment.obs" ] }, { "cell_type": "markdown", "id": "3a32a17e-5efa-4cc6-9487-c904c6e0d519", "metadata": { "tags": [] }, "source": [ "The `domain` is the bounds within which data can be read or written -- currently, `soma_joinids` from 0 to 2637, inclusive. This can be resized later, as we'll see in the notebook that explains TileDB-SOMA's append mode." ] }, { "cell_type": "code", "execution_count": 8, "id": "edfe04d5-5efc-4ac0-adb9-26a7b21d6da1", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "((0, 2637),)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.domain" ] }, { "cell_type": "code", "execution_count": 9, "id": "2ac36ad7-0bfd-48fd-9a04-c183af227bae", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "soma_joinid: int64 not null\n", "obs_id: large_string\n", "n_genes: int64\n", "percent_mito: float\n", "n_counts: float\n", "louvain: dictionary" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.schema" ] }, { "cell_type": "markdown", "id": "dfcccee0-6d0a-4a8c-bf5a-897ed38c1749", "metadata": { "tags": [] }, "source": [ "Note that `soma_joinid` is a field that exists in each `DataFrame` and acts as a join key for other objects, such as `SparseNDArray` (more on this later)." ] }, { "cell_type": "markdown", "id": "a25cb83c-0e1a-4a2a-be21-8235ff63a647", "metadata": { "tags": [] }, "source": [ "When a `DataFrame` is accessed, only metadata is retrieved, not actual data. This is important since a DataFrame can be very large and might not fit in memory.\n", "\n", "To materialize the dataframe (or a subset) in memory, we call `df.read()`. \n", "\n", "If the dataframe is small, we can convert it to an in-memory Pandas object like this:" ] }, { "cell_type": "code", "execution_count": 10, "id": "26676c4f-dfb8-4f48-9bc5-1a66ee085f9e", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
00AAACATACAACCAC-17810.0301782419.0CD4 T cells
11AAACATTGAGCTAC-113520.0379364903.0B cells
22AAACATTGATCAGC-111310.0088973147.0CD4 T cells
33AAACCGTGCTTCCG-19600.0174312639.0CD14+ Monocytes
44AAACCGTGTATGCG-15220.012245980.0NK cells
.....................
26332633TTTCGAACTCTCAT-111550.0211043459.0CD14+ Monocytes
26342634TTTCTACTGAGGCA-112270.0092943443.0B cells
26352635TTTCTACTTCCTCG-16220.0219711684.0B cells
26362636TTTGCATGAGAGGC-14540.0205481022.0B cells
26372637TTTGCATGCCTCAC-17240.0080651984.0CD4 T cells
\n", "

2638 rows × 6 columns

\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts \\\n", "0 0 AAACATACAACCAC-1 781 0.030178 2419.0 \n", "1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 \n", "2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 \n", "3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 \n", "4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 \n", "... ... ... ... ... ... \n", "2633 2633 TTTCGAACTCTCAT-1 1155 0.021104 3459.0 \n", "2634 2634 TTTCTACTGAGGCA-1 1227 0.009294 3443.0 \n", "2635 2635 TTTCTACTTCCTCG-1 622 0.021971 1684.0 \n", "2636 2636 TTTGCATGAGAGGC-1 454 0.020548 1022.0 \n", "2637 2637 TTTGCATGCCTCAC-1 724 0.008065 1984.0 \n", "\n", " louvain \n", "0 CD4 T cells \n", "1 B cells \n", "2 CD4 T cells \n", "3 CD14+ Monocytes \n", "4 NK cells \n", "... ... \n", "2633 CD14+ Monocytes \n", "2634 B cells \n", "2635 B cells \n", "2636 B cells \n", "2637 CD4 T cells \n", "\n", "[2638 rows x 6 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read().concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "630e8cdd-7ff2-4fb9-8c2a-650e16b3b43b", "metadata": { "tags": [] }, "source": [ "Here, `read()` returns an iterator, `concat()` materializes all rows to memory and `to_pandas()` returns a Pandas view of the dataframe." ] }, { "cell_type": "markdown", "id": "c5a1bbc5-742d-483d-b572-fef0c5caa4c4", "metadata": { "tags": [] }, "source": [ "If the dataframe is bigger, we can only select a subset of it before materializing. This will only retrieve the required subset from disk to memory, so very large dataframes can be queried this way. In this example, we will only select the first 10 rows:" ] }, { "cell_type": "code", "execution_count": 11, "id": "32bfed6c-b0b7-41ed-986c-df7d462498c4", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
00AAACATACAACCAC-17810.0301782419.0CD4 T cells
11AAACATTGAGCTAC-113520.0379364903.0B cells
22AAACATTGATCAGC-111310.0088973147.0CD4 T cells
33AAACCGTGCTTCCG-19600.0174312639.0CD14+ Monocytes
44AAACCGTGTATGCG-15220.012245980.0NK cells
55AAACGCACTGGTAC-17820.0166442163.0CD8 T cells
66AAACGCTGACCAGT-17830.0381612175.0CD8 T cells
77AAACGCTGGTTCTT-17900.0309732260.0CD8 T cells
88AAACGCTGTAGCCA-15330.0117651275.0CD4 T cells
99AAACGCTGTTTCTG-15500.0290121103.0FCGR3A+ Monocytes
1010AAACTTGAAAAACG-111160.0263163914.0B cells
\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts \\\n", "0 0 AAACATACAACCAC-1 781 0.030178 2419.0 \n", "1 1 AAACATTGAGCTAC-1 1352 0.037936 4903.0 \n", "2 2 AAACATTGATCAGC-1 1131 0.008897 3147.0 \n", "3 3 AAACCGTGCTTCCG-1 960 0.017431 2639.0 \n", "4 4 AAACCGTGTATGCG-1 522 0.012245 980.0 \n", "5 5 AAACGCACTGGTAC-1 782 0.016644 2163.0 \n", "6 6 AAACGCTGACCAGT-1 783 0.038161 2175.0 \n", "7 7 AAACGCTGGTTCTT-1 790 0.030973 2260.0 \n", "8 8 AAACGCTGTAGCCA-1 533 0.011765 1275.0 \n", "9 9 AAACGCTGTTTCTG-1 550 0.029012 1103.0 \n", "10 10 AAACTTGAAAAACG-1 1116 0.026316 3914.0 \n", "\n", " louvain \n", "0 CD4 T cells \n", "1 B cells \n", "2 CD4 T cells \n", "3 CD14+ Monocytes \n", "4 NK cells \n", "5 CD8 T cells \n", "6 CD8 T cells \n", "7 CD8 T cells \n", "8 CD4 T cells \n", "9 FCGR3A+ Monocytes \n", "10 B cells " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read((slice(0,10),)).concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "be2bfd77-d54c-4181-a962-6c00610c122a", "metadata": { "tags": [] }, "source": [ "We can also select a subset of the columns:" ] }, { "cell_type": "code", "execution_count": 12, "id": "703fe8ad-7123-4311-a58b-b00a27c7a483", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
obs_idn_genes
0AAACATACAACCAC-1781
1AAACATTGAGCTAC-11352
2AAACATTGATCAGC-11131
3AAACCGTGCTTCCG-1960
4AAACCGTGTATGCG-1522
5AAACGCACTGGTAC-1782
6AAACGCTGACCAGT-1783
7AAACGCTGGTTCTT-1790
8AAACGCTGTAGCCA-1533
9AAACGCTGTTTCTG-1550
10AAACTTGAAAAACG-11116
\n", "
" ], "text/plain": [ " obs_id n_genes\n", "0 AAACATACAACCAC-1 781\n", "1 AAACATTGAGCTAC-1 1352\n", "2 AAACATTGATCAGC-1 1131\n", "3 AAACCGTGCTTCCG-1 960\n", "4 AAACCGTGTATGCG-1 522\n", "5 AAACGCACTGGTAC-1 782\n", "6 AAACGCTGACCAGT-1 783\n", "7 AAACGCTGGTTCTT-1 790\n", "8 AAACGCTGTAGCCA-1 533\n", "9 AAACGCTGTTTCTG-1 550\n", "10 AAACTTGAAAAACG-1 1116" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read((slice(0, 10),), column_names=[\"obs_id\", \"n_genes\"]).concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "67e2b4f4-ba5b-4a9c-a847-9da954b4c467", "metadata": { "tags": [] }, "source": [ "Finally, we can use `value_filter` to retrieve a filtered subset of rows that match a certain condition." ] }, { "cell_type": "code", "execution_count": 13, "id": "a5ef3a97-abc3-4d80-ab48-1898fa64d566", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidobs_idn_genespercent_miton_countslouvain
026AAATCAACCCTATT-115450.0243135676.0CD4 T cells
159AACCTACTGTGAGG-116520.0158395682.0CD14+ Monocytes
2107AAGCACTGGTTCTT-117170.0235666153.0B cells
3109AAGCCATGAACTGC-118770.0140157064.0Dendritic cells
4247ACCCAGCTGTTAGC-115470.0206005534.0CD14+ Monocytes
.....................
702508TTACTCGACGCAAT-116030.0248515030.0Dendritic cells
712530TTATGGCTTATGGC-117830.0220646164.0Dendritic cells
722597TTGAGGACTACGCA-117940.0244406342.0Dendritic cells
732623TTTAGCTGTACTCT-115670.0211605671.0Dendritic cells
742632TTTCGAACACCTGA-115440.0130194455.0Dendritic cells
\n", "

75 rows × 6 columns

\n", "
" ], "text/plain": [ " soma_joinid obs_id n_genes percent_mito n_counts \\\n", "0 26 AAATCAACCCTATT-1 1545 0.024313 5676.0 \n", "1 59 AACCTACTGTGAGG-1 1652 0.015839 5682.0 \n", "2 107 AAGCACTGGTTCTT-1 1717 0.023566 6153.0 \n", "3 109 AAGCCATGAACTGC-1 1877 0.014015 7064.0 \n", "4 247 ACCCAGCTGTTAGC-1 1547 0.020600 5534.0 \n", ".. ... ... ... ... ... \n", "70 2508 TTACTCGACGCAAT-1 1603 0.024851 5030.0 \n", "71 2530 TTATGGCTTATGGC-1 1783 0.022064 6164.0 \n", "72 2597 TTGAGGACTACGCA-1 1794 0.024440 6342.0 \n", "73 2623 TTTAGCTGTACTCT-1 1567 0.021160 5671.0 \n", "74 2632 TTTCGAACACCTGA-1 1544 0.013019 4455.0 \n", "\n", " louvain \n", "0 CD4 T cells \n", "1 CD14+ Monocytes \n", "2 B cells \n", "3 Dendritic cells \n", "4 CD14+ Monocytes \n", ".. ... \n", "70 Dendritic cells \n", "71 Dendritic cells \n", "72 Dendritic cells \n", "73 Dendritic cells \n", "74 Dendritic cells \n", "\n", "[75 rows x 6 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs.read((slice(None),), value_filter=\"n_genes > 1500\").concat().to_pandas()" ] }, { "cell_type": "markdown", "id": "6b60c290-7ce8-4324-9694-2b76b802dd9a", "metadata": { "tags": [] }, "source": [ "## Collection" ] }, { "cell_type": "markdown", "id": "e961b2f8-5e77-4c40-be87-4283fe9da010", "metadata": {}, "source": [ "A `Collection` is a persistent container of named SOMA objects, stored as a mapping of string keys and SOMA object values.\n", "\n", "The `ms` member in an Experiment is implemented as a Collection. Let's take a look:" ] }, { "cell_type": "code", "execution_count": 14, "id": "d437b606-8338-4220-966d-59c4bf48fd13", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment.ms" ] }, { "cell_type": "markdown", "id": "c02367a7-425d-4135-993c-1dd880b394c5", "metadata": { "tags": [] }, "source": [ "In this case, we have two members: `raw` and `test_exp_name`. They can be accessed as they were dict members:" ] }, { "cell_type": "code", "execution_count": 15, "id": "0574abf8-5f72-4a05-a90f-608fdda2db07", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment.ms[\"raw\"]" ] }, { "cell_type": "markdown", "id": "3d87abc9-08d2-4a15-b9f7-eb9ed4f9791e", "metadata": { "tags": [] }, "source": [ "## DenseNDArray" ] }, { "cell_type": "markdown", "id": "4d8e320c-1484-4334-948f-9852d8e23f47", "metadata": {}, "source": [ "A ``DenseNDArray`` is a dense, N-dimensional array, with offset (zero-based) integer indexing on each dimension. \n", "\n", "`DenseNDArray` has a user-defined schema, which includes:\n", "- the element type, expressed as an Arrow type, indicating the type of data contained within the array, and\n", "- the shape of the array, i.e., the number of dimensions and the length of each dimension" ] }, { "cell_type": "markdown", "id": "f9d92b04-6556-4ec2-8efc-4906a632fdea", "metadata": {}, "source": [ "In a SOMA single cell experiment, the cell by gene matrix X is typically represented either by `DenseNDArray` or `SparseNDArray`. Let's take a look at our example:" ] }, { "cell_type": "code", "execution_count": 16, "id": "c8c2aa17-52d7-4bd5-a5f3-b58c18fdcb11", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = experiment[\"ms\"][\"RNA\"].X\n", "X" ] }, { "cell_type": "markdown", "id": "8f67613f-6075-4bb2-9e47-885dcfeb313e", "metadata": {}, "source": [ "Within the experiment, `X` is a `Collection` and the data can be accessed using `[\"data\"]`:" ] }, { "cell_type": "code", "execution_count": 17, "id": "5ff89028-e44b-47d9-9dac-d5fe14f92d18", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = X[\"data\"]\n", "X" ] }, { "cell_type": "markdown", "id": "e2b161fe-96f9-4c07-843d-211e63bba818", "metadata": { "tags": [] }, "source": [ "We can inspect the `DenseNDArray` and get useful information by using `.schema`:" ] }, { "cell_type": "code", "execution_count": 18, "id": "7ef469db-fede-48a3-97ee-622a44e19970", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "soma_dim_0: int64 not null\n", "soma_dim_1: int64 not null\n", "soma_data: float not null" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.schema" ] }, { "cell_type": "markdown", "id": "1dea46bf-69b4-4df0-9168-c01548d488e4", "metadata": { "tags": [] }, "source": [ "In this case, we see there are two dimensions and the data is of type `float`." ] }, { "cell_type": "markdown", "id": "9d7b9e04-abed-492c-8ff2-958f0c922ca0", "metadata": {}, "source": [ "As with with domain for `obs`, this has a shape: the boundary within which data can be read or written. This may be resizable later as we'll see in the notebook on TileDB-SOMA's append mode." ] }, { "cell_type": "code", "execution_count": 19, "id": "d4bebe41-5fb2-43fe-85e2-7f4680edfa35", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "(2638, 1838)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "markdown", "id": "f6feae7b-bb6e-4029-85a8-a5375bc53b86", "metadata": { "tags": [] }, "source": [ "Similarly to `DataFrame`, when opening a `DenseNDArray` only metadata is fetched, and the array isn't fetched into memory. \n", "\n", "We can convert the matrix into a `pyarrow.Tensor` using `.read()`:" ] }, { "cell_type": "code", "execution_count": 21, "id": "2c586c5c-055b-4bc7-9995-851dd802d961", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "\n", "type: float\n", "shape: (2638, 1838)\n", "strides: (7352, 4)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.read()" ] }, { "cell_type": "markdown", "id": "209d9b42-d8d3-4d67-b3a0-b45daaef1ee3", "metadata": { "tags": [] }, "source": [ "From here, we can convert it further to a `numpy.ndarray`:" ] }, { "cell_type": "code", "execution_count": 22, "id": "6d90e592-7a67-4a41-af08-05aa3807167a", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,\n", " -0.2090951 , -0.5312034 ],\n", " [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,\n", " -0.31314576, -0.5966544 ],\n", " [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,\n", " -0.17087643, 1.379 ],\n", " ...,\n", " [-0.2070895 , -0.250464 , -0.046397 , ..., -0.05114426,\n", " -0.16106427, 2.0414972 ],\n", " [-0.19032837, -0.2263336 , -0.04399938, ..., -0.00591773,\n", " -0.13521303, -0.48211113],\n", " [-0.33378917, -0.2535875 , -0.05271563, ..., -0.07842438,\n", " -0.13032717, -0.4713379 ]], dtype=float32)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.read().to_numpy()" ] }, { "cell_type": "markdown", "id": "63e894c8-65fc-486d-b2b7-9b989f8e1cea", "metadata": { "tags": [] }, "source": [ "This will only work on small matrices, since a `numpy` array needs to be in memory. \n", "\n", "We can retrieve a subset of the matrix passing coordinates to `.read()`. Here we're only retrieving the first 10 rows of the matrix:" ] }, { "cell_type": "code", "execution_count": 23, "id": "4a3b1f45-017d-4b92-9f2e-c88e8e3aa234", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[-0.17146951, -0.28081203, -0.04667679, ..., -0.09826884,\n", " -0.2090951 , -0.5312034 ],\n", " [-0.21458222, -0.37265295, -0.05480444, ..., -0.26684406,\n", " -0.31314576, -0.5966544 ],\n", " [-0.37688747, -0.2950843 , -0.0575275 , ..., -0.15865596,\n", " -0.17087643, 1.379 ],\n", " ...,\n", " [-0.15813293, -0.27562705, -0.04569191, ..., -0.08687588,\n", " -0.2062048 , 1.6869122 ],\n", " [ 4.861763 , -0.23054866, -0.04826924, ..., -0.02755091,\n", " -0.11788268, -0.4664504 ],\n", " [-0.12453113, -0.23373608, -0.04131226, ..., -0.00758654,\n", " -0.16255915, -0.50339466]], dtype=float32)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sliced_X = X.read((slice(0,9),)).to_numpy()\n", "sliced_X" ] }, { "cell_type": "code", "execution_count": 24, "id": "007f1e15-61cd-40b8-bb23-4102662ab3af", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "(10, 1838)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sliced_X.shape" ] }, { "cell_type": "markdown", "id": "84b630da-669f-4503-b0e0-6316d265608f", "metadata": {}, "source": [ "Note that `DenseNDArray` is always indexed, on each dimension, using zero-based integers. If this dimension matches any other object in the experiment, the `soma_joinid` column can be used to retrieve the correct slice.\n", "\n", "In the following example, we will get the values of X for the gene tagged as `ICOSLG`. This involves reading the `var` DataFrame using a `value_filter`, retrieving the `soma_joinid` for the gene and passing it as coordinate to `X.read`:\n" ] }, { "cell_type": "code", "execution_count": 25, "id": "d6d39b44-33b3-4cb7-8a34-d30b94899ad1", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[-0.12167774],\n", " [-0.05866209],\n", " [-0.07043106],\n", " ...,\n", " [-0.1320983 ],\n", " [-0.14978862],\n", " [-0.10383061]], dtype=float32)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "var = experiment.ms[\"RNA\"].var\n", "idx = var.read(value_filter=\"var_id == 'ICOSLG'\").concat()[\"soma_joinid\"].to_numpy()\n", "\n", "X.read((None, int(idx[0]))).to_numpy()" ] }, { "cell_type": "markdown", "id": "3c33cc1d-25b9-4973-b076-f78dce246cdd", "metadata": {}, "source": [ "## SparseNDArray" ] }, { "cell_type": "markdown", "id": "91416b83-72a9-47a8-824c-67eb8987d937", "metadata": { "tags": [] }, "source": [ "A `SparseNDArray` is a sparse, N-dimensional array, with offset (zero-based) integer indexing on each dimension. `SparseNDArray` has a user-defined schema, which includes:\n", "- the element type, expressed as an Arrow type, indicating the type of data\n", " contained within the array, and\n", "- the shape of the array, i.e., the number of dimensions and the length of\n", " each dimension" ] }, { "cell_type": "markdown", "id": "cf955720-db72-4536-b568-134e308d17e0", "metadata": {}, "source": [ "A `SparseNDArray` is functionally similar to a `DenseNDArray`, except that only elements that have a nonzero value are actually stored. Elements that are not explicitly stored are assumed to be zeros." ] }, { "cell_type": "markdown", "id": "836a3a97-901e-4428-8ee3-d93a49a1ee26", "metadata": { "tags": [] }, "source": [ "As an example, we will load a version of pbmc3k that has been generated using a `SparseNDArray`:" ] }, { "cell_type": "code", "execution_count": 26, "id": "71f8fed4-4ffd-4f30-a5b0-4e3a4a3730f3", "metadata": { "tags": [] }, "outputs": [], "source": [ "import tarfile\n", "import tempfile\n", "\n", "sparse_uri = tempfile.mktemp()\n", "with tarfile.open(\"data/pbmc3k-sparse.tgz\") as handle:\n", " handle.extractall(sparse_uri)\n", "experiment = tiledbsoma.Experiment.open(sparse_uri)" ] }, { "cell_type": "markdown", "id": "76935223-eda5-48c0-89f5-26e5bfdf3628", "metadata": {}, "source": [ "Let's take a look at the schema:" ] }, { "cell_type": "code", "execution_count": 27, "id": "41897a5c-2225-49f9-b9f2-3a68a6ad8079", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "soma_dim_0: int64 not null\n", "soma_dim_1: int64 not null\n", "soma_data: float not null" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = experiment.ms[\"RNA\"].X[\"data\"]\n", "X.schema" ] }, { "cell_type": "markdown", "id": "ff9fcfd3-b456-430c-9a98-6cd46d2dd9d2", "metadata": {}, "source": [ "This is the same as the `DenseNDArray` version, which makes sense since it's still a 2-dimensional matrix with `float` data.\n", "\n", "Let's look at the shape:" ] }, { "cell_type": "code", "execution_count": 28, "id": "42c1d852-6492-4a5e-b1fe-bc9af3f83639", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "(2638, 1838)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "markdown", "id": "5f91e73f-aa79-4fa2-a763-ad792e5641a8", "metadata": {}, "source": [ "This too has a shape: the boundary within which data can be read or written. This can be resized later as we'll see in the notebook on TileDB-SOMA's append mode." ] }, { "cell_type": "markdown", "id": "c3aa3b74-4c59-421c-96fe-215989103a41", "metadata": { "tags": [] }, "source": [ "We can get the number of nonzero elements by calling `.nnz`:" ] }, { "cell_type": "code", "execution_count": 29, "id": "2862f737-4f08-4886-9496-fe7771b4a581", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "4848644" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.nnz" ] }, { "cell_type": "markdown", "id": "6b4f394f-8d6c-4ac5-9d45-997958b319a5", "metadata": { "tags": [] }, "source": [ "In order to work with a `SparseNDArray`, we call `.read()`:" ] }, { "cell_type": "code", "execution_count": 30, "id": "eaa0f9aa-8167-4f26-a52f-4d9636dde37b", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.read()" ] }, { "cell_type": "markdown", "id": "bd75223e-6378-4159-8478-0b970ee2d5a4", "metadata": {}, "source": [ "This returns a SparseNDArrayRead that can be used for getting iterators. For instance, we can do:" ] }, { "cell_type": "code", "execution_count": 31, "id": "00a7899f-2d28-4f07-b438-ab4d4d6bcfe5", "metadata": { "tags": [] }, "outputs": [], "source": [ "tensor = X.read().coos().concat()" ] }, { "cell_type": "markdown", "id": "127721ae-4c0b-42ac-ac9e-20dd9e62a682", "metadata": { "tags": [] }, "source": [ "This returns an [Arrow Tensor](https://arrow.apache.org/docs/cpp/api/tensor.html) that can be used to access the array, or convert it further to different formats. For instance:" ] }, { "cell_type": "code", "execution_count": 32, "id": "f62472c6-e67c-44f0-8ed2-df9bec3ae3e8", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tensor.to_scipy()" ] }, { "cell_type": "markdown", "id": "a4f47d52-2883-4515-afc2-d3f9d9d4ad31", "metadata": { "tags": [] }, "source": [ "can be used to transform it to a [SciPy coo_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html). " ] }, { "cell_type": "markdown", "id": "a064c7b3-5ac5-4b0f-855a-0ffee47a709c", "metadata": { "tags": [] }, "source": [ "Similarly to `DenseNDArray`s, we can call `.read()` with a slice to only obtain a subset of the matrix. As an example:" ] }, { "cell_type": "code", "execution_count": 33, "id": "d5d9ca87-58cc-44bf-ba48-7e2bf3b6c5a7", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sliced_X = X.read((slice(0,9),)).coos().concat().to_scipy()\n", "sliced_X" ] }, { "cell_type": "markdown", "id": "c4dbf334-e525-45a2-b1cf-8531d064a89c", "metadata": { "tags": [] }, "source": [ "Let's verify that the slice is correct. To do that, we can call `nonzero()` on the `scipy.sparse.coo_matrix` to obtain the coordinates of the nonzero items, and look at the coordinates over the first dimension:" ] }, { "cell_type": "code", "execution_count": 34, "id": "d542f63e-7ca8-4e68-8933-cb15f17bc8cb", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, ..., 9, 9, 9], dtype=int32)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sliced_X.nonzero()[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "1f812d75-bd42-417c-8337-0111cd648a85", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }