Skip to content

dimi-lab/STITCH

Repository files navigation

Spatial and single-cell Transcriptomics Integration Tool for CHaracterization (STITCH)

Introduction

This Nextflow pipeline provides a comprehensive framework for tertiary analysis of spatial transcriptomics data, supporting platforms including Visium, VisiumHD, Seeker, Trekker, and single-cell RNA-sequencing (scRNA-seq). The pipeline includes the following modules:

  1. Quality Control (QC): Evaluate the quality of the input data, and identify thresholds for QC metrics (adapative thresholds or hard thresholds).
  2. Normalization: Apply data normalization to ensure comparability between samples (SCT, LogNormalize, scran, SpaNorm, or TFIDF).
  3. Spatial Variable Gene (SVG) Identification: Identification of spatially variable genes (moransi, SparkX).
  4. Data Integration and Merging: Combine multiple samples using integration-based analysis (high batch-effect, cca, rpca, harmony, fastmnn, scvi) or merge-based analysis (minimal batch-effect or different sample-to-sample compositions).
  5. Clustering: Identify clusters of cells or regions based on gene expression (Louvain, Leiden) and/or spatial context (Banksy).
  6. Cell Type Annotation: Assign cell type identities to clusters via deconvolution (RCTD) or reference-based mapping (Seurat).
  7. Differential Expression Analysis: Detect differentially expressed genes across conditions or clusters (wilcox, MAST, DESeq2, etc).
  8. Gene Set Enrichment Analysis (GSEA): Perform pathway and gene set enrichment analysis on marker and differential expression results (fgsea).
  9. Co-localization Analysis: Evaluate spatial co-localization and separation of cell types across spatial scales (spatial data only, CRAWDAD).
  10. Reporting: Generate summary reports for QC and a final analysis report.

This pipeline is designed to provide reproducible and efficient analysis workflows, generating both intermediate and final outputs. The pipeline is largely built on Seurat framework.


Setup Instructions

We recommend to run the pipeline via singularity or docker. The docker container can be downloaded from docker hub. Alternatively, you can set up local environment via step 1&2 below.

1. Clone the Repository

Clone the pipeline repository to your local machine using the following command:

git clone https://github.com/Liuy12/STITCH.git
cd STITCH
## optional, specify .cache directory for renv
mkdir .cache/
export RENV_PATHS_CACHE="$PWD/.cache/"

2. Install R and Python Dependencies

Ensure you have R version 4.4.1 installed and loaded.

## optional
## e.g. load required R version via module
module load r/4.4.1
## load pandoc
module load pandoc
## load nextflow
module load nextflow

Open a new R session, then use renv to restore the required R packages:

renv::restore()

This will install all the necessary R packages specified in the repository, and might take a while.

Create a conda environment based on .yml file.

conda env create -f environment.yml

3. Prepare the Sample Information Sheet

Create a sample information sheet in tab-delimited format with at least the following first three columns. Ensure that the first three column names are "sampleid", "condition", "secondary_output". Make sure all fields are tab-seperated. Otherwise, the pipeline will fail.

  • sampleid: Unique identifier for each sample. Ensure the sample ids do not contain space or special characters.
  • condition: Experimental condition for the sample, e.g. Control or Case.
  • secondary_output: Path to the secondary output directory from Cell Ranger (scRNA-seq) or Space Ranger (Visium). DO NOT set this to the 'filtered_feature_bc_matrix' folder. Set it to the 'outs' folder that include all output from Cell Ranger/Space Ranger.
  • cellsel: optional fourth column. Path to a file that contains cells/barcodes that are selected/included for each sample. For Seeker data, this could include beads that's on tissue. This field could also be helpful for scenarios where user wants to subset to a selected/defined cells/barcodes for analysis.

An example samplesheet.tsv. Noted secondary_output is set to the 'outs' folder:

sampleid    condition   secondary_output
sample1 control /path/to/sample1/outs
sample2 treatment   /path/to/sample2/outs

4. Modify the Configuration File

Adjust the provided configuration file (e.g., nextflow.config.scRNAseq.human or nextflow.config.spatial.human) to suit your analysis. Some key parameters to examine/modify include:

  • generic_data_type: one of Visium, VisiumHD, Seeker, Trekker, scRNAseq
  • generic_feature_list: Path to genes of interest, one gene per line; Final report will generate visualizations of expression levels for those genes. Set to 'NA' to disable.
  • generic_identity_file: A file containg two columns (cellid and identity). This file could be helpful in scenarios where user want to perform subcluster analysis for selected cells after sample integration/merging step, or want to provide a refined identity that will be used for downstream analysis, e.g. DEA, colocalization. Set to "NA" to disable.
  • generic_output_dir: Path to pipeline output directory.
  • generic_geneinfo: Path to gene level annotation file. This is used to add feature level meta data. Make sure to check the version of reference used
  • qc_qc_only: Whether to stop the pipeline after QC. This could be helpful to identify cutoffs for various QC metrics. Recommendation is to set qc_only to true -> evaluate QC report -> update qc cutoffs (if needed) -> set qc_only to false to resume pipeline
  • qc_adaptive_cutoff_flag: Whether to apply adaptive cutoff identification (based on IQR). Rather than selecting the same cutoffs across all samples, this will identify cutoffs based on distribution of QC metrics within each sample to create sample-specific cutoffs. Could be helpful if you expect different metric distributions across samples. Recommendation is to always turn on 'adaptive_cutoff_flag' during qc step -> evaluate qc report -> enable or disable adaptive_cutoff_flag
  • norm_norm_dimreduc: Normalization method for dimension reduction/differential testing.
  • norm_norm_diff: Normalization method for differential testing.
  • norm_cellcycle_correction_flag: Whether to estimate and correct for cell-cycle effect for clustering.
  • combine_merge_analysis/combine_integration_analysis: whether to perform merge-based/integration-based analysis.
  • combine_merge_only/combine_integration_only: Whether to stop after merge-based/integration-based analysis. Recommend to always enable it to examine the results first before proceeding to DE analysis, e.g., user can also supply a cell identity file via generic_identity_file.
  • combine_integration_method: Integration strategy.
  • combine_sketch_flag: Whether to perform sketch-based workflow. Could be helpful for large datasets. However, enabling BPCells on disk operation via combine_bpcells_flag will typically be sufficient for large datasets.
  • combine_bpcells_flag: Whether to perform on-disk operations via BPCells. Recommended for large datasets to reduce memory usage.
  • cluster_resolution: Resolution parameter used to identify number of clusters.
  • cluster_embed_method: Method for spatial embedding.
  • diff_idents: Could be one of "cluster" (unsupervised clusters), "decon_cell_type" (if deconvolution analysis is enabled), "map_cell_type" (if mapping-based analysis is enabled).
  • diff_control_var/diff_case_var: control/case group for differential expression analysis.
  • diff_covariate_list: Covariates to adjust, when performing differential analysis between conditions.
  • diff_test: Statistical test.

Example config files are provided for human and mouse:

  • nextflow.config.scRNAseq.human: human scRNAseq.
  • nextflow.config.scRNAseq.mouse: mouse scRNAseq.
  • nextflow.config.spatial.human: human spatial (Visium/VisiumHD/Seeker/Trekker).
  • nextflow.config.spatial.mouse: mouse spatial (Visium/VisiumHD/Seeker/Trekker).

5. Run the Pipeline

Execute the pipeline with the following command:

nextflow run main.nf --samplesheet samplesheet.tsv -c nextflow.config.spatial.human -work-dir ./work
  • --samplesheet: Path to the prepared sample sheet.
  • -c: Specifies the configuration file.
  • -work-dir: Specifies processing directory.

The pipeline currently supports local(default, -profile local), slurm (-profile slurm), local_apptainer (-profile local_apptainer), slurm_apptainer (-profile slurm_apptainer), and batch (-profile batch). You can modify the config file based on your own needs. You can add -resume option to the command if you want to resume a pipeline.


Parameter naming

This pipeline uses a small, explicit prefixing convention for Nextflow params to group related settings and make configuration files easier to scan and maintain. Prefixes are lower-case to follow common Nextflow/Unix conventions and to avoid case-sensitivity issues across platforms. Current prefixes and their meanings:

  • generic_ : general workflow settings (samplesheet, workflow path, author, etc.)
  • qc_ : quality-control related settings (flags and cutoffs)
  • norm_ : normalization-related settings
  • cluster_ : clustering and visualization settings
  • decon_ : deconvolution (cell type mapping/deconvolution) settings
  • map_ : mapping/mapping-to-reference settings
  • combine_ : merge/integration settings
  • diff_ : differential expression settings
  • coloc_ : co-localization analysis settings (spatial only)
  • gsea_ : gene-set enrichment analysis settings

Notes

  • Ensure all required dependencies (Nextflow, R, and other tools) are installed and configured.
  • Customize the pipeline to suit your specific data and experimental design.
  • For further assistance, consult the documentation or open an issue in the repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors