Skip to content

dimi-lab/StereoAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StereoAnalysis

Stereo-seq Spatial Transcriptomics Pipeline

A modular Python pipeline for analyzing Stereo-seq spatial transcriptomics data from STOmics chips. Supports cell-bin and square-bin analysis modes, multi-resolution clustering, SingleR cell type annotation, LLM-based annotation, and HTML report generation.


Project Structure

stereo_analysis/
├── src/python/
│   ├── stereo_analysis.py          # Main analysis pipeline
│   ├── stereo_recluster.py         # Reclustering on subsets
│   ├── singleR_anno.py             # Standalone SingleR annotation
│   ├── parse_singleR_results.py    # Parse and summarize SingleR outputs
│   ├── plot_llm_annotations_spatial.py  # Spatial plots for LLM annotations
│   └── generate_stomics_report.py  # HTML report generation
├── results/                        # Output directory (auto-created)
│   ├── cellbin/
│   │   ├── figures/
│   │   ├── tables/
│   │   └── *.h5ad
│   ├── bin20/
│   └── bin50/
├── environment.yml
├── requirements.txt
└── README.md

Analysis Modes

Mode Input Unit Use case
cellbin label.cellbin.gef Single cell High-resolution, biology-driven
bin20 square_bin.gef 20×20 DNB bin Fine-grained exploratory
bin50 square_bin.gef 50×50 DNB bin Coarser regional structure
all Both GEFs All of above Run all three in one go

Pipeline Steps

Input GEF
    │
    ├─ [0] Raw QC from lasso h5ad          (cellbin only)
    ├─ [1] Load GEF
    ├─ [2] QC metrics + plots
    ├─ [3] Filter → Normalize → Log1p → HVG → Scale → PCA
    ├─ [4] Neighbors → Spatial neighbors → UMAP
    ├─ [5] Multi-resolution Leiden clustering
    ├─ [6] Base Leiden clustering + per-cluster spatial plots
    ├─ [7] Marker gene detection
    ├─ [8] SingleR cell type annotation     (optional, requires --ref)
    ├─ [9] LLM-based cell type annotation   (optional, requires vLLM server)
    └─ [10] Export AnnData h5ad

After the main pipeline, downstream scripts handle:

  • Reclustering of specific clusters (stereo_recluster.py)
  • LLM-based annotation of clusters using vLLM-hosted models (plot_llm_annotations_spatial.py)
  • HTML report generation (generate_stomics_report.py)

LLM-Based Cell Type Annotation

In addition to reference-based annotation via SingleR, the pipeline supports LLM-driven cell type annotation using locally hosted open-source models. This approach uses marker genes identified per cluster as input prompts and returns structured cell type predictions without requiring a curated reference dataset.

Models Supported

Model Size Notes
LLaMA 3.1 8B Meta; strong general biomedical reasoning
Mistral 7B Efficient; good marker gene interpretation
Qwen 2.5 7B Alibaba; competitive annotation quality

All three models are served via vLLM on Mayo HPC (rcfgpu12, 4× A100-SXM4-40GB GPUs) and queried through a local OpenAI-compatible API endpoint.

How It Works

  1. Top marker genes per cluster (from step [7]) are passed as a structured prompt to the model
  2. The model returns a predicted cell type and confidence rationale in JSON format
  3. Predictions are parsed, validated, and mapped back onto spatial coordinates
  4. Results are visualized as spatial scatter plots and summarized in the HTML report

Benchmarking

All three models (LLaMA, Mistral, Qwen) were benchmarked across bin20, bin50, and cellbin modes on gut tissue samples. Results were manually reviewed to assess annotation consistency, cluster coverage, and biological plausibility. Model outputs are saved separately to allow side-by-side comparison.


Usage

Single mode

# Cell-bin
python src/python/stereo_analysis.py \
    --gef  data/label.cellbin.gef \
    --h5ad data/lasso.cellbin.h5ad \
    --yid  Y40144M9 --block left \
    --mode cellbin

# Bin50
python src/python/stereo_analysis.py \
    --gef  data/square_bin.gef \
    --yid  Y40144M9 --block left \
    --mode bin50

All modes at once

python src/python/stereo_analysis.py \
    --gef      data/square_bin.gef \
    --gef-cell data/label.cellbin.gef \
    --h5ad     data/lasso.cellbin.h5ad \
    --ref      data/reference.h5ad \
    --yid      Y40144M9 --block left \
    --mode     all

With cell type annotation

python src/python/stereo_analysis.py \
    --gef  data/label.cellbin.gef \
    --h5ad data/lasso.cellbin.h5ad \
    --ref  data/reference.h5ad \
    --ref-col ClusterName \
    --annot-method cluster \
    --yid  Y40144M9 --block left

Key Parameters

Parameter Default Description
--mode cellbin Analysis mode: cellbin / bin20 / bin50 / all
--resolutions 0.3 0.5 0.8 1.0 1.5 Leiden resolution sweep
--leiden-base 1.0 Resolution for base clustering + markers
--n-top-genes 2000 Highly variable genes
--n-pcs 30 PCA components
--min-counts 20 Min UMI per cell (cellbin)
--min-genes 5 Min genes per cell (cellbin)
--pct-mt 20 Max mitochondrial % (cellbin)
--bin-min-counts 50 Min UMI per bin (bin modes)
--ref None Reference H5AD for SingleR annotation
--annot-method cluster SingleR granularity: cluster or cell
--singler-method cpu SingleR backend: cpu or rapids (GPU)

Outputs

For each mode, outputs are written to results/{mode}/:

Figures (figures/)

  • QC histograms, UMI vs genes scatter, violin plots
  • PCA elbow, HVG scatter
  • UMAP per resolution, spatial cluster scatter plots
  • Annotation spatial plots and composition bar charts
  • LLM annotation spatial plots per model (LLaMA, Mistral, Qwen)

Tables (tables/)

  • Per-cell/bin QC metrics
  • HVG gene lists
  • Cluster sizes per resolution
  • Top marker genes (filtered + full)
  • Cell type annotation per cell and summary
  • LLM annotation results per model with parsed predictions

H5AD — processed AnnData object for downstream use


Samples

Sample ID Block Notes
Y40144M9 left Gut tissue
Y40144N7 left Gut tissue
Y40144P6 left Gut tissue

Environment

# Create environment
mamba env create -f environment.yml
conda activate stereopy

# Verify
python -c "import stereo as st; print(st.__version__)"

Requires Python 3.8. See SETUP.md for full instructions including SLURM usage and SingleR patch preservation.


Notes

  • Gene naming: pipeline auto-detects Ensembl IDs vs gene symbols and skips ribo/MT filtering accordingly
  • SingleR patch: stereo/algorithm/single_r/single_r.py has been patched — back up before upgrading stereopy
  • GPU: rapids backend for SingleR and vLLM inference both require a CUDA-enabled node (rcfgpu12, A100)
  • LLM server: vLLM must be running and accessible before invoking LLM annotation steps; see SETUP.md for startup instructions
  • EDM correction: cell boundary expansion is applied upstream during GEF generation by the STOmics Cell Bin pipeline, prior to this analysis

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors