Spatial Tx: ingest 10x Visium sample data

WARNING Spatial support is experimental and under active development. There will likely be breaking changes to both the storage format and API.

[1]:
import shlex
from functools import partial
from os import makedirs, remove
from os.path import exists, join
from shutil import rmtree
from subprocess import check_call
from sys import stderr

import matplotlib.patches as mplp
import matplotlib.pyplot as plt
import numpy as np
import scanpy as sc
from matplotlib.collections import PatchCollection

import tiledbsoma
from tiledbsoma import Experiment
from tiledbsoma.io.spatial import from_visium

err = partial(print, file=stderr)

/home/jules/.pyenv/versions/3.12.7/envs/soma/lib/python3.12/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning.
  warnings.warn(

Papermill parameters:

[2]:
# 10x visium sample data paths to download; will be skipped if already present locally, unless `overwrite`
url_base = 'https://cf.10xgenomics.com/samples/spatial-exp/2.1.0'
dataset_name = 'CytAssist_FFPE_Protein_Expression_Human_Glioblastoma'
filtered_h5 = 'CytAssist_FFPE_Protein_Expression_Human_Glioblastoma_filtered_feature_bc_matrix.h5'
spatial_tar = 'CytAssist_FFPE_Protein_Expression_Human_Glioblastoma_spatial.tar.gz'

# Local download/ingestion paths/configs:
data_root = 'data'     # Sync 10x Visium spatial data into f"{data_root}/{dataset_name}/10x". By default, spatial SOMA data will be ingested into f"{data_root}/{dataset_name}/soma" as well.
exp_uri = None         # Ingest spatial SOMA data here; defaults to f"{data_root}/{dataset_name}/soma"
scene_name = "scene1"  # Scene name to write, in ingested spatial SOMA
overwrite_10x = False  # If  already exists, remove and re-ingest it
overwrite_exp = False  # If `exp_uri` already exists, remove and re-ingest it
[3]:
# Set default paths
dataset_root = join(data_root, dataset_name)
data_dir_10x = join(dataset_root, '10x')
if exp_uri is None:
    exp_uri = join(dataset_root, 'soma')
exp_uri
[3]:
'data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma'
[4]:
def sh(*cmd):
    err(f"Running: {shlex.join(cmd)}")
    check_call(cmd)

Download sample data from 10x

This section will download data from 10x and use that data to generate the TileDB-SOMA Experiment with spatial data.

[5]:
if exists(exp_uri):
    if overwrite_exp:
        err(f"Removing {exp_uri}")
        rmtree(exp_uri)
[6]:
paths = {
    filtered_h5: 'filtered_feature_bc_matrix.h5',
    spatial_tar: spatial_tar
}
if exists(exp_uri):
    err(f"{exp_uri} exists; skipping 10x data download")
else:
    for src_name, dst_name in paths.items():
        src = f'{url_base}/{dataset_name}/{src_name}'
        dst = join(data_dir_10x, dst_name)
        if exists(dst):
            if overwrite_10x:
                err(f"{dst} exists, removing")
                remove(dst)
        if not exists(dst):
            makedirs(data_dir_10x, exist_ok=True)
            sh('wget', '-qO', dst, src)
            if dst.endswith('.tar.gz'):
                sh('tar', '-C', data_dir_10x, '-xvf', dst)
data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma exists; skipping 10x data download
[7]:
if not exists(exp_uri):
    err(f"Ingesting {data_dir_10x} to {exp_uri}")
    from_visium(
        exp_uri,
        input_path=data_dir_10x,
        measurement_name="RNA",
        scene_name=scene_name,
        use_raw_counts=False,
    )

Data access

Spatial SOMA experiments can be acess and queries using any of tiledbsoma’s exsiting APIs.

[8]:
exp = Experiment.open(exp_uri)
exp.spatial
[8]:
<Collection 'file:///home/jules/Software/TileDB-Inc/TileDB-SOMA/apis/python/notebooks/data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma/spatial' (open for 'r') (1 item)
    'scene1': 'file:///home/jules/Software/TileDB-Inc/TileDB-SOMA/apis/python/notebooks/data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma/spatial/scene1' (unopened)>

Here, we’re loading the non-spatial elements into memory as a standard AnnData object.

[9]:
adata = tiledbsoma.io.to_anndata(
    experiment=exp,
    measurement_name="RNA",
    X_layer_name="data",
)

adata
/home/jules/.pyenv/versions/3.12.7/envs/soma/lib/python3.12/site-packages/anndata/_core/anndata.py:401: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(
[9]:
AnnData object with n_obs × n_vars = 5756 × 18085
    var: 'gene_ids', 'feature_types', 'genome', 'isotype_control', 'normalized', 'pattern', 'read', 'secondary_name', 'sequence'

Let’s visualize the most highly expressed genes in the dataset.

[10]:
sc.pl.highest_expr_genes(adata)
../_images/notebooks_tutorial_spatial_16_0.png

Spatial components

The new SOMA Experiment contains spatial data for this experiment stored in the spatial property of the Experiment. If we view the spatial collection we there is only one Scene named scene0.

[11]:
scene = exp.spatial[scene_name]
scene
[11]:
<Scene 'file:///home/jules/Software/TileDB-Inc/TileDB-SOMA/apis/python/notebooks/data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma/spatial/scene1' (open for 'r') (3 items)
    'img': 'file:///home/jules/Software/TileDB-Inc/TileDB-SOMA/apis/python/notebooks/data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma/spatial/scene1/img' (unopened)
    'obsl': 'file:///home/jules/Software/TileDB-Inc/TileDB-SOMA/apis/python/notebooks/data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma/spatial/scene1/obsl' (unopened)
    'varl': 'file:///home/jules/Software/TileDB-Inc/TileDB-SOMA/apis/python/notebooks/data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma/spatial/scene1/varl' (unopened)>

The scene contains three folders:

img - A SOMA Collection that stores imagery data. In this example it contains an image pyramid for the high- and low-resolution slide.
obsl - A SOMA Collection storing location data in the form of SOMADataFrames defined on the obs somajoinid (or obs_id). In this example, this contains a single dataframe with basic information about the Visium spot locations and sizes.
varl - A collection that stores location data in form form of dataframes defined on the var somajoinid (or var_id). This collection is nested with the first layer mapping from measurement name to a collection storing dataframes. There is no feature spatial data in the Visium example, so this collection is empty.

Inside the img collection, there is a Image2DCollection storing the slide images. Here we view basic information about the slide zoom levels and read the data in the hires image.

The scene is defined on a pixel coordinate space.

[12]:
scene.coordinate_space
[12]:
CoordinateSpace(axes=(Axis(name='x', unit='pixels'), Axis(name='y', unit='pixels')))

Images

Inside the img collection, there is a MultiscaleImage storing the slide images. Here we view basic information about the slide zoom levels and read the data in the hires image.

[13]:
tissue_image = scene.img["tissue"]
tissue_image
[13]:
<MultiscaleImage 'file:///home/jules/Software/TileDB-Inc/TileDB-SOMA/apis/python/notebooks/data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma/spatial/scene1/img/tissue' (open for 'r')>
[14]:
tissue_image.coordinate_space
[14]:
CoordinateSpace(axes=(Axis(name='x', unit='pixels'), Axis(name='y', unit='pixels')))

Use the level_count property to get the number of levels in the MultiscaleImage.

[15]:
tissue_image.level_count
[15]:
2

Examine the metadata for each image level in the “tissue” collection.

[16]:
for level in range(tissue_image.level_count):
    print(f"Level {level} shape: {tissue_image.level_shape(level)}")
Level 0 shape: (3, 2000, 1744)
Level 1 shape: (3, 600, 523)

Accessing individual elements of the MultiscaleImage collection returns a DenseNDArray that can be read using the standard API.

[17]:
lowres = tissue_image["lowres"]
lowres
[17]:
<DenseNDArray 'file:///home/jules/Software/TileDB-Inc/TileDB-SOMA/apis/python/notebooks/data/CytAssist_FFPE_Protein_Expression_Human_Glioblastoma/soma/spatial/scene1/img/tissue/lowres' (open for 'r')>
[18]:
im = lowres.read().to_numpy()
[19]:
im = np.moveaxis(im, 0, -1).astype(np.uint8)  # Move channel axis to final axis for viewing.
plt.imshow(im)
[19]:
<matplotlib.image.AxesImage at 0xff6b402b8560>
../_images/notebooks_tutorial_spatial_32_1.png

Spatial Dataframe

The obsl collection stores location data in the loc dataframe. This dataframes stores the spot locations of the Visium dataset with soma_joinid matching those used in the obs dataframe in the root Experiment.

[20]:
spots_point_cloud = scene.obsl["loc"]
[21]:
spots_point_cloud.coordinate_space
[21]:
CoordinateSpace(axes=(Axis(name='x', unit='pixels'), Axis(name='y', unit='pixels')))
[22]:
spots = spots_point_cloud.read().concat().to_pandas()
spots
[22]:
x y soma_joinid in_tissue array_row array_col spot_diameter_fullres
0 -87 3078 5436 1 26 182 184.593855
1 17 680 1735 1 17 181 184.593855
2 28 1212 1410 1 19 181 184.593855
3 39 1745 1337 1 21 181 184.593855
4 165 411 166 1 16 180 184.593855
... ... ... ... ... ... ... ...
5751 22616 22038 4115 1 99 37 184.593855
5752 22913 21500 5220 1 97 35 184.593855
5753 23231 22026 5016 1 99 33 184.593855
5754 22638 23103 2766 1 103 37 184.593855
5755 22627 22570 4910 1 101 37 184.593855

5756 rows × 7 columns

[23]:
obs_df = exp.obs.read().concat().to_pandas()
obs_df
[23]:
soma_joinid obs_id
0 0 AACAATGGAACCACAT-1
1 1 AACAATGTGCTCCGAG-1
2 2 AACACCAGCCTACTCG-1
3 3 AACACCATTCGCATAC-1
4 4 AACACCGAATGTCTCA-1
... ... ...
5751 5751 TGTTGGCCTGTAGCGG-1
5752 5752 TGTTGGTGCGCACGAG-1
5753 5753 TGTTGGTGCGCTTCGC-1
5754 5754 TGTTGGTGCGGAATCA-1
5755 5755 TGTTGGTGGACTCAGG-1

5756 rows × 2 columns

[24]:
joinid_counts = spots.soma_joinid.value_counts().value_counts()
assert len(joinid_counts) == 1
joinid_counts
[24]:
count
1    5756
Name: count, dtype: int64

We take the data from the spot dataframe and create a plot showing the regions in the tissue.

[25]:
radius = scene.obsl["loc"].metadata["soma_geometry"]
spot_patches = PatchCollection([
    mplp.Circle((row["x"], row["y"]), radius=radius, color='b')
    for _, row in spots.iterrows()
])
[26]:
fig, ax = plt.subplots()
ax.set_xlim([0, 24592])
ax.set_ylim([0, 28202])
ax.invert_yaxis()
ax.add_collection(spot_patches)
plt.show()
../_images/notebooks_tutorial_spatial_41_0.png

Reading a region using coordinate transforms.

Using Coordinate Spaces and Transforms

Coordinate spaces and coordinate transforms can be used to project both the Visium spots and the same image.

The scene allows us to check the coordinate spaces of individual elements.

[27]:
scene.get_transform_to_multiscale_image("tissue")
[27]:
<ScaleTransform
  input axes: ('x', 'y')
  output axes: ('x', 'y')
  scales: [0.07091737 0.07091696]>
[28]:
scene.get_transform_to_multiscale_image("tissue", level=1)
[28]:
<ScaleTransform
  input axes: ('x', 'y')
  output axes: ('x', 'y')
  scales: [0.02126708 0.02127509]>
[29]:
scene.get_transform_to_point_cloud_dataframe("loc")
[29]:
<IdentityTransform
  input axes: ('x', 'y')
  output axes: ('x', 'y')>

We want to take a piece of the region that both the multiscale image and point cloud are defined on to do the following:

  • Read the region from the highest resolution level of the multiscale image.

  • Read the region from the point cloud storing the spot locations.

  • Use the transformation from the multiscale image to adjust and output of the point cloud and plot the two items together.

[30]:
x_min, x_max = (0, 12000)
y_min, y_max = (14000, 24592)
fullres_region = [x_min, y_min, x_max, y_max]
fullres_to_image = scene.get_transform_to_multiscale_image("tissue")
[31]:
hires_read = tissue_image.read_spatial_region(0,  fullres_region, region_transform=fullres_to_image)
hires_read
[31]:
SpatialRead(data=<pyarrow.Tensor>
type: uint8
shape: (3, 753, 853)
strides: (642309, 853, 1), data_coordinate_space=CoordinateSpace(axes=(Axis(name='x', unit='pixels'), Axis(name='y', unit='pixels'))), output_coordinate_space=CoordinateSpace(axes=(Axis(name='x', unit=None), Axis(name='y', unit=None))), coordinate_transform=<AffineTransform
  input axes: ('x', 'y')
  output axes: ('x', 'y')
  augmented matrix:
    [[1.41009174e+01 0.00000000e+00 0.00000000e+00]
     [0.00000000e+00 1.41010000e+01 1.39881920e+04]
     [0.00000000e+00 0.00000000e+00 1.00000000e+00]]>)
[32]:
hires_read.coordinate_transform.augmented_matrix
[32]:
array([[1.41009174e+01, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.41010000e+01, 1.39881920e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00]])

We take the data from the spot dataframe, convert it to the hires resolution, and plot the two images together.

[33]:
point_reader = scene.obsl["loc"].read_spatial_region(fullres_region)
point_reader
[33]:
SpatialRead(data=<tiledbsoma._read_iters.TableReadIter object at 0xff6afc7015e0>, data_coordinate_space=CoordinateSpace(axes=(Axis(name='x', unit='pixels'), Axis(name='y', unit='pixels'))), output_coordinate_space=CoordinateSpace(axes=(Axis(name='x', unit='pixels'), Axis(name='y', unit='pixels'))), coordinate_transform=<IdentityTransform
  input axes: ('x', 'y')
  output axes: ('x', 'y')>)
[34]:
spots = point_reader.data.concat().to_pandas()
spots
[34]:
x y soma_joinid in_tissue array_row array_col spot_diameter_fullres
0 1683 14221 1284 1 68 172 184.593855
1 1375 14227 3904 1 68 174 184.593855
2 1694 14753 1442 1 70 172 184.593855
3 1842 14484 3328 1 69 171 184.593855
4 2001 14747 3760 1 70 170 184.593855
... ... ... ... ... ... ... ...
923 11406 23069 5511 1 102 110 184.593855
924 11576 23864 4769 1 105 109 184.593855
925 11713 23063 5611 1 102 108 184.593855
926 11735 24127 4327 1 106 108 184.593855
927 11883 23858 5166 1 105 107 184.593855

928 rows × 7 columns

[35]:
spot_to_hires_matrix = hires_read.coordinate_transform.inverse_transform().augmented_matrix
spot_to_hires_matrix
[35]:
array([[ 7.09173715e-02,  0.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  7.09169562e-02, -9.92000000e+02],
       [ 0.00000000e+00,  0.00000000e+00,  1.00000000e+00]])
[36]:
scale_x = spot_to_hires_matrix[0, 0]
scale_y = spot_to_hires_matrix[1, 1]
offset_x = spot_to_hires_matrix[0, 2]
offset_y = spot_to_hires_matrix[1, 2]
[37]:
radius = scene.obsl["loc"].metadata["soma_geometry"]
spot_patches = PatchCollection([
    mplp.Ellipse(
        (scale_x * row["x"] + offset_x, scale_y * row["y"] + offset_y),
        width = radius * scale_x,
        height = radius * scale_y,
        fill=False,
        alpha=0.8,
    )
    for _, row in spots.iterrows()
])
[38]:
fig, ax = plt.subplots()
ax.imshow(np.moveaxis(hires_read.data, 0, -1).astype(np.uint8))
ax.add_collection(spot_patches)

plt.show()
../_images/notebooks_tutorial_spatial_57_0.png
[ ]: