StereoAnalysis

Stereo-seq Spatial Transcriptomics Pipeline

A modular Python pipeline for analyzing Stereo-seq spatial transcriptomics data from STOmics chips. Supports cell-bin and square-bin analysis modes, multi-resolution clustering, SingleR cell type annotation, LLM-based annotation, and HTML report generation.

Project Structure

stereo_analysis/
├── src/python/
│   ├── stereo_analysis.py          # Main analysis pipeline
│   ├── stereo_recluster.py         # Reclustering on subsets
│   ├── singleR_anno.py             # Standalone SingleR annotation
│   ├── parse_singleR_results.py    # Parse and summarize SingleR outputs
│   ├── plot_llm_annotations_spatial.py  # Spatial plots for LLM annotations
│   └── generate_stomics_report.py  # HTML report generation
├── results/                        # Output directory (auto-created)
│   ├── cellbin/
│   │   ├── figures/
│   │   ├── tables/
│   │   └── *.h5ad
│   ├── bin20/
│   └── bin50/
├── environment.yml
├── requirements.txt
└── README.md

Analysis Modes

Mode	Input	Unit	Use case
`cellbin`	`label.cellbin.gef`	Single cell	High-resolution, biology-driven
`bin20`	`square_bin.gef`	20×20 DNB bin	Fine-grained exploratory
`bin50`	`square_bin.gef`	50×50 DNB bin	Coarser regional structure
`all`	Both GEFs	All of above	Run all three in one go

Pipeline Steps

Input GEF
    │
    ├─ [0] Raw QC from lasso h5ad          (cellbin only)
    ├─ [1] Load GEF
    ├─ [2] QC metrics + plots
    ├─ [3] Filter → Normalize → Log1p → HVG → Scale → PCA
    ├─ [4] Neighbors → Spatial neighbors → UMAP
    ├─ [5] Multi-resolution Leiden clustering
    ├─ [6] Base Leiden clustering + per-cluster spatial plots
    ├─ [7] Marker gene detection
    ├─ [8] SingleR cell type annotation     (optional, requires --ref)
    ├─ [9] LLM-based cell type annotation   (optional, requires vLLM server)
    └─ [10] Export AnnData h5ad

After the main pipeline, downstream scripts handle:

Reclustering of specific clusters (stereo_recluster.py)
LLM-based annotation of clusters using vLLM-hosted models (plot_llm_annotations_spatial.py)
HTML report generation (generate_stomics_report.py)

LLM-Based Cell Type Annotation

In addition to reference-based annotation via SingleR, the pipeline supports LLM-driven cell type annotation using locally hosted open-source models. This approach uses marker genes identified per cluster as input prompts and returns structured cell type predictions without requiring a curated reference dataset.

Models Supported

Model	Size	Notes
LLaMA 3.1	8B	Meta; strong general biomedical reasoning
Mistral	7B	Efficient; good marker gene interpretation
Qwen 2.5	7B	Alibaba; competitive annotation quality

All three models are served via vLLM on Mayo HPC (rcfgpu12, 4× A100-SXM4-40GB GPUs) and queried through a local OpenAI-compatible API endpoint.

How It Works

Top marker genes per cluster (from step [7]) are passed as a structured prompt to the model
The model returns a predicted cell type and confidence rationale in JSON format
Predictions are parsed, validated, and mapped back onto spatial coordinates
Results are visualized as spatial scatter plots and summarized in the HTML report

Benchmarking

All three models (LLaMA, Mistral, Qwen) were benchmarked across bin20, bin50, and cellbin modes on gut tissue samples. Results were manually reviewed to assess annotation consistency, cluster coverage, and biological plausibility. Model outputs are saved separately to allow side-by-side comparison.

Usage

Single mode

# Cell-bin
python src/python/stereo_analysis.py \
    --gef  data/label.cellbin.gef \
    --h5ad data/lasso.cellbin.h5ad \
    --yid  Y40144M9 --block left \
    --mode cellbin

# Bin50
python src/python/stereo_analysis.py \
    --gef  data/square_bin.gef \
    --yid  Y40144M9 --block left \
    --mode bin50

All modes at once

python src/python/stereo_analysis.py \
    --gef      data/square_bin.gef \
    --gef-cell data/label.cellbin.gef \
    --h5ad     data/lasso.cellbin.h5ad \
    --ref      data/reference.h5ad \
    --yid      Y40144M9 --block left \
    --mode     all

With cell type annotation

python src/python/stereo_analysis.py \
    --gef  data/label.cellbin.gef \
    --h5ad data/lasso.cellbin.h5ad \
    --ref  data/reference.h5ad \
    --ref-col ClusterName \
    --annot-method cluster \
    --yid  Y40144M9 --block left

Key Parameters

Parameter	Default	Description
`--mode`	`cellbin`	Analysis mode: cellbin / bin20 / bin50 / all
`--resolutions`	`0.3 0.5 0.8 1.0 1.5`	Leiden resolution sweep
`--leiden-base`	`1.0`	Resolution for base clustering + markers
`--n-top-genes`	`2000`	Highly variable genes
`--n-pcs`	`30`	PCA components
`--min-counts`	`20`	Min UMI per cell (cellbin)
`--min-genes`	`5`	Min genes per cell (cellbin)
`--pct-mt`	`20`	Max mitochondrial % (cellbin)
`--bin-min-counts`	`50`	Min UMI per bin (bin modes)
`--ref`	`None`	Reference H5AD for SingleR annotation
`--annot-method`	`cluster`	SingleR granularity: cluster or cell
`--singler-method`	`cpu`	SingleR backend: cpu or rapids (GPU)

Outputs

For each mode, outputs are written to results/{mode}/:

Figures (figures/)

QC histograms, UMI vs genes scatter, violin plots
PCA elbow, HVG scatter
UMAP per resolution, spatial cluster scatter plots
Annotation spatial plots and composition bar charts
LLM annotation spatial plots per model (LLaMA, Mistral, Qwen)

Tables (tables/)

Per-cell/bin QC metrics
HVG gene lists
Cluster sizes per resolution
Top marker genes (filtered + full)
Cell type annotation per cell and summary
LLM annotation results per model with parsed predictions

H5AD — processed AnnData object for downstream use

Samples

Sample ID	Block	Notes
Y40144M9	left	Gut tissue
Y40144N7	left	Gut tissue
Y40144P6	left	Gut tissue

Environment

# Create environment
mamba env create -f environment.yml
conda activate stereopy

# Verify
python -c "import stereo as st; print(st.__version__)"

Requires Python 3.8. See SETUP.md for full instructions including SLURM usage and SingleR patch preservation.

Notes

Gene naming: pipeline auto-detects Ensembl IDs vs gene symbols and skips ribo/MT filtering accordingly
SingleR patch: stereo/algorithm/single_r/single_r.py has been patched — back up before upgrading stereopy
GPU: rapids backend for SingleR and vLLM inference both require a CUDA-enabled node (rcfgpu12, A100)
LLM server: vLLM must be running and accessible before invoking LLM annotation steps; see SETUP.md for startup instructions
EDM correction: cell boundary expansion is applied upstream during GEF generation by the STOmics Cell Bin pipeline, prior to this analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StereoAnalysis

Stereo-seq Spatial Transcriptomics Pipeline

Project Structure

Analysis Modes

Pipeline Steps

LLM-Based Cell Type Annotation

Models Supported

How It Works

Benchmarking

Usage

Single mode

All modes at once

With cell type annotation

Key Parameters

Outputs

Samples

Environment

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
src		src
README.md		README.md
environment.yml		environment.yml
readme.md		readme.md
requirements.txt		requirements.txt
setup.md		setup.md

Folders and files

Latest commit

History

Repository files navigation

StereoAnalysis

Stereo-seq Spatial Transcriptomics Pipeline

Project Structure

Analysis Modes

Pipeline Steps

LLM-Based Cell Type Annotation

Models Supported

How It Works

Benchmarking

Usage

Single mode

All modes at once

With cell type annotation

Key Parameters

Outputs

Samples

Environment

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages