A modular Python pipeline for analyzing Stereo-seq spatial transcriptomics data from STOmics chips. Supports cell-bin and square-bin analysis modes, multi-resolution clustering, SingleR cell type annotation, LLM-based annotation, and HTML report generation.
stereo_analysis/
├── src/python/
│ ├── stereo_analysis.py # Main analysis pipeline
│ ├── stereo_recluster.py # Reclustering on subsets
│ ├── singleR_anno.py # Standalone SingleR annotation
│ ├── parse_singleR_results.py # Parse and summarize SingleR outputs
│ ├── plot_llm_annotations_spatial.py # Spatial plots for LLM annotations
│ └── generate_stomics_report.py # HTML report generation
├── results/ # Output directory (auto-created)
│ ├── cellbin/
│ │ ├── figures/
│ │ ├── tables/
│ │ └── *.h5ad
│ ├── bin20/
│ └── bin50/
├── environment.yml
├── requirements.txt
└── README.md
| Mode | Input | Unit | Use case |
|---|---|---|---|
cellbin |
label.cellbin.gef |
Single cell | High-resolution, biology-driven |
bin20 |
square_bin.gef |
20×20 DNB bin | Fine-grained exploratory |
bin50 |
square_bin.gef |
50×50 DNB bin | Coarser regional structure |
all |
Both GEFs | All of above | Run all three in one go |
Input GEF
│
├─ [0] Raw QC from lasso h5ad (cellbin only)
├─ [1] Load GEF
├─ [2] QC metrics + plots
├─ [3] Filter → Normalize → Log1p → HVG → Scale → PCA
├─ [4] Neighbors → Spatial neighbors → UMAP
├─ [5] Multi-resolution Leiden clustering
├─ [6] Base Leiden clustering + per-cluster spatial plots
├─ [7] Marker gene detection
├─ [8] SingleR cell type annotation (optional, requires --ref)
├─ [9] LLM-based cell type annotation (optional, requires vLLM server)
└─ [10] Export AnnData h5ad
After the main pipeline, downstream scripts handle:
- Reclustering of specific clusters (
stereo_recluster.py) - LLM-based annotation of clusters using vLLM-hosted models (
plot_llm_annotations_spatial.py) - HTML report generation (
generate_stomics_report.py)
In addition to reference-based annotation via SingleR, the pipeline supports LLM-driven cell type annotation using locally hosted open-source models. This approach uses marker genes identified per cluster as input prompts and returns structured cell type predictions without requiring a curated reference dataset.
| Model | Size | Notes |
|---|---|---|
| LLaMA 3.1 | 8B | Meta; strong general biomedical reasoning |
| Mistral | 7B | Efficient; good marker gene interpretation |
| Qwen 2.5 | 7B | Alibaba; competitive annotation quality |
All three models are served via vLLM on Mayo HPC (rcfgpu12, 4× A100-SXM4-40GB GPUs) and queried through a local OpenAI-compatible API endpoint.
- Top marker genes per cluster (from step [7]) are passed as a structured prompt to the model
- The model returns a predicted cell type and confidence rationale in JSON format
- Predictions are parsed, validated, and mapped back onto spatial coordinates
- Results are visualized as spatial scatter plots and summarized in the HTML report
All three models (LLaMA, Mistral, Qwen) were benchmarked across bin20, bin50, and cellbin modes on gut tissue samples. Results were manually reviewed to assess annotation consistency, cluster coverage, and biological plausibility. Model outputs are saved separately to allow side-by-side comparison.
# Cell-bin
python src/python/stereo_analysis.py \
--gef data/label.cellbin.gef \
--h5ad data/lasso.cellbin.h5ad \
--yid Y40144M9 --block left \
--mode cellbin
# Bin50
python src/python/stereo_analysis.py \
--gef data/square_bin.gef \
--yid Y40144M9 --block left \
--mode bin50python src/python/stereo_analysis.py \
--gef data/square_bin.gef \
--gef-cell data/label.cellbin.gef \
--h5ad data/lasso.cellbin.h5ad \
--ref data/reference.h5ad \
--yid Y40144M9 --block left \
--mode allpython src/python/stereo_analysis.py \
--gef data/label.cellbin.gef \
--h5ad data/lasso.cellbin.h5ad \
--ref data/reference.h5ad \
--ref-col ClusterName \
--annot-method cluster \
--yid Y40144M9 --block left| Parameter | Default | Description |
|---|---|---|
--mode |
cellbin |
Analysis mode: cellbin / bin20 / bin50 / all |
--resolutions |
0.3 0.5 0.8 1.0 1.5 |
Leiden resolution sweep |
--leiden-base |
1.0 |
Resolution for base clustering + markers |
--n-top-genes |
2000 |
Highly variable genes |
--n-pcs |
30 |
PCA components |
--min-counts |
20 |
Min UMI per cell (cellbin) |
--min-genes |
5 |
Min genes per cell (cellbin) |
--pct-mt |
20 |
Max mitochondrial % (cellbin) |
--bin-min-counts |
50 |
Min UMI per bin (bin modes) |
--ref |
None |
Reference H5AD for SingleR annotation |
--annot-method |
cluster |
SingleR granularity: cluster or cell |
--singler-method |
cpu |
SingleR backend: cpu or rapids (GPU) |
For each mode, outputs are written to results/{mode}/:
Figures (figures/)
- QC histograms, UMI vs genes scatter, violin plots
- PCA elbow, HVG scatter
- UMAP per resolution, spatial cluster scatter plots
- Annotation spatial plots and composition bar charts
- LLM annotation spatial plots per model (LLaMA, Mistral, Qwen)
Tables (tables/)
- Per-cell/bin QC metrics
- HVG gene lists
- Cluster sizes per resolution
- Top marker genes (filtered + full)
- Cell type annotation per cell and summary
- LLM annotation results per model with parsed predictions
H5AD — processed AnnData object for downstream use
| Sample ID | Block | Notes |
|---|---|---|
| Y40144M9 | left | Gut tissue |
| Y40144N7 | left | Gut tissue |
| Y40144P6 | left | Gut tissue |
# Create environment
mamba env create -f environment.yml
conda activate stereopy
# Verify
python -c "import stereo as st; print(st.__version__)"Requires Python 3.8. See SETUP.md for full instructions including SLURM usage and SingleR patch preservation.
- Gene naming: pipeline auto-detects Ensembl IDs vs gene symbols and skips ribo/MT filtering accordingly
- SingleR patch:
stereo/algorithm/single_r/single_r.pyhas been patched — back up before upgrading stereopy - GPU: rapids backend for SingleR and vLLM inference both require a CUDA-enabled node (rcfgpu12, A100)
- LLM server: vLLM must be running and accessible before invoking LLM annotation steps; see
SETUP.mdfor startup instructions - EDM correction: cell boundary expansion is applied upstream during GEF generation by the STOmics Cell Bin pipeline, prior to this analysis