Automated Annotation of Spatial Transcriptomics#
This notebook uses an automated pipeline to label spatial transcriptomic datasets. Here’s how it works:
Create adata from spatial transcriptomic data via binning.
Use UCE (Universal Cell Embedding) to embed both single cell and spatial transcriptomic data.
Train a classifier (on UCE of single cell data) to learn cell type labels.
Use classifer to predict cell type of each spatial bin.
This tutorial will use mouse single cell and spatial data. Merscope, Visium, VisiumHD, and Xenium are all supported.
The tutorial cell_type_annotation_with_a_label_transfer_model is also a good reference for the label transfer
To start, we need the UCE embeddings of the single cell data. Cellxgene census has the UCE embedding of the single cell data already.
Get Single Cell Data#
import anndict as adt
import cellxgene_census
census = cellxgene_census.open_soma(census_version="2023-12-15")
adata = cellxgene_census.get_anndata(
census,
organism = "mus_musculus",
measurement_name = "RNA",
obs_value_filter = "(tissue_general == 'heart') | (tissue_general == 'liver')",
obs_embeddings = ["uce"]
)
Next, we break the single cell adata into a per-tissue adata_dict and (we’ll eventually train a separate classifier on each tissue).
#build dict
adata_dict = adt.build_adata_dict(adata, strata_keys=['tissue'], desired_strata=[('heart',), ('liver',)])
#Downsample dict and remove celltypes with a small number of cells
#This helps speed up classifier training.
adata_dict = adt.wrappers.sample_and_drop_adata_dict(adata_dict, strata_keys=['cell_type'], min_num_cells=50, n_obs=1000)
Load Spatial Data#
Now, load the spatial data.
The first step here is to create anndata from raw spatial data (i.e. transcript coordinates and identity stored in a file called transcripts.csv or detected_transcripts.csv)
AnnDictionary offers two ways to build adata from raw spatial data:
build_adata_from_transcript_positions() each cell in this adata will contain all the transcripts from a box of a user-defined size.
build_adata_from_visium() same thing, but the box size is already defined.
For this tutorial, we use build_adata_from_transcript_positions()
because we’re dealing with Merscope data, but the syntax is similar for build_adata_from_visium()
.
#This dictionary should be {input_path: output_path}, where input_path is a csv file path, and output_path is where the anndata will be written
#Note, input paths can be .csv or .parquet!
paths_dict = {
'~/dat/detected_transcripts_liver.csv': '~/dat/liver_st_merscope.h5ad',
'~/dat/detected_transcripts_heart.csv': '~/dat/heart_st_merscope.h5ad'
}
#This function should be used to generate adata from merscope or xenium output. For Visium you can use adt.build_adata_from_visium(paths_dict, hd=False) (see docs, set hd=True for Visium HD)
adt.build_adata_from_transcript_positions(paths_dict, box_size=16, step_size=16, platform="Merscope")
#Commented-out example for Visium HD
# paths_dict = {
# '~/visium_hd_runs/liver/16_micron_binsize': '~/dat/liver_visium_hd.h5ad',
# '~/visium_hd_runs/heart/16_micron_binsize': '~/dat/heart_visium_hd.h5ad'
# }
#Generate adata from visium
# adt.build_adata_from_visium(paths_dict, hd=True)
Next, We need to calculate UCE Embedding of Merscope data. anndict
has a function for that. Note, the function below (while it will work), is included for demonstration purposes only. It should be run with access to a gpu to decrease computational time.
adt.UCE_adata(['~/dat/liver_st_merscope.h5ad',
'~/dat/heart_st_merscope.h5ad'])
#Load the UCE embeddings of the spatial data as an adata_dict
#Note: it's import that the keys of st_dict match the keys of adata_dict
spatial_dict = adt.read_adata_dict_from_h5ad(['~/UCE/uce_wd/heart_st_merscope_uce_adata.h5ad', '~/UCE/uce_wd/liver_st_merscope_uce_adata.h5ad'], keys=[('heart',), ('liver',)])
Predict Cell Types#
#Transfer the labels
from sklearn.linear_model import LogisticRegression
# Note, if SLURM_CPUS_PER_TASK and/or SLURM_NTASKS environment variables are set, the function will automatically determine number of cores and multithread using adt.get_slurm_cores()
adt.adata_dict_fapply(
adata_dict,
adt.transfer_labels_using_classifier,
destination_adata=spatial_dict,
origin_label_key="cell_type",
feature_key="uce", #the key in origin_adata.obsm that contains the features you want to use for label transfer
classifier=LogisticRegression,
new_column_name="predicted_cell_type",
random_state=42 #for reproducibility
)
Plot#
#Plot the results
adt.wrappers.plot_spatial_adata_dict(st_dict, ['predicted_cell_type'])
Save#
#save the labeled data
adt.write_adata_dict(spatial_dict, filename="path/to/your/labeled/spatial_dict")