EVscope: A Comprehensive Bioinformatics Pipeline for Accurate and Robust Total RNA Sequencing Analysis of Extracellular Vesicles
EVscope is an open-source, modular bioinformatics pipeline designed for the analysis of extracellular vesicle (EV) RNA sequencing data. Tailored to address the challenges of EV RNA-seqβlow RNA yield, fragmented transcripts, diverse RNA biotypes, and contaminationβEVscope processes paired-end or single-end FASTQ files through a robust, end-to-end workflow. It includes quality control, UMI-based deduplication, two-pass STAR alignment, circular RNA detection, expression quantification, contamination screening, tissue deconvolution, and comprehensive reporting. Optimized for the SMARTer Stranded Total RNA-Seq Kit v3 (Pico Input), EVscope introduces a novel expectation-maximization (EM) algorithm for multi-mapping read assignment and a unique read-through detection method for Read1 trimming.
- Key Features
- Motivation
- Directory Structure
- Requirements
- Installation
- Usage
- Input Data Format
- Pipeline Steps
- Output Structure
- Troubleshooting
- FAQ
- Contributing
- Feedback
- Citation
- Credits
- License
- Contact
- Novel Read-Through Detection: Trims UMI-derived adapter sequences from Read1 using reverse-complemented Read2 UMIs (
bin/Step_03_UMIAdapterTrimR1.py). - EM-Based Multi-Mapping Resolution: Assigns multi-mapped reads at single-base resolution using a genome-wide expectation-maximization algorithm (
bin/Step_25_EMapper.py). - Comprehensive RNA Annotation: Supports 3,659,642 RNAs across 20 biotypes (e.g., protein-coding, lncRNAs, miRNAs, piRNAs, retrotransposons) from GENCODE v45, piRBase v3.0, and RepeatMasker.
- Dual circRNA Detection: Integrates CIRCexplorer2 and CIRI2 for robust circular RNA identification, with merged results for enhanced sensitivity (
bin/Step_10_circRNA_merge.py). - Tissue Deconvolution: Infers EV RNA cellular origins using GTEx v10 and Human Brain Cell Atlas v1.0 references (
bin/Step_22_run_RNA_deconvolution_ARIC.py). - Contamination Screening: Filters bacterial (BBSplit) and microbial (Kraken2) contamination, with optional genomic DNA correction via strand-specific subtraction.
- Extensive Quality Control: Validates raw and trimmed FASTQs (FastQC), UMI motifs, and alignment metrics (
bin/Step_24_generate_QC_matrix.py). - Expression Quantification: Produces TPM/CPM matrices using featureCounts and RSEM, with RNA distribution visualizations (
bin/Step_15_plot_RNA_distribution_*.py). - Interactive Reporting: Generates bigWig tracks, density plots, and a comprehensive HTML report via R Markdown (
bin/Step_27_html_report.Rmd). - Reproducibility: Single-command Bash script with Conda environments, containerization support, and detailed logging.
Extracellular vesicles (EVs) are critical mediators of intercellular communication, carrying diverse RNAs that serve as potential biomarkers for diseases like cancer and neurodegeneration. However, EV RNA sequencing faces unique challenges: low RNA abundance, fragmented transcripts, contamination from genomic DNA or bacterial RNA, and the presence of non-polyadenylated RNAs (e.g., miRNAs, lncRNAs). Standard RNA-seq pipelines, designed for cellular RNA, often fail to address these issues, leading to unreliable results due to multi-mapping reads, incomplete RNA annotations, or unfiltered contaminants.
EVscope provides a specialized, end-to-end pipeline optimized for EV total RNA-seq. By integrating innovative algorithms, comprehensive RNA annotations, and robust quality control, EVscope delivers accurate, reproducible results, enabling biomarker discovery and advancing EV research.
The EVscope repository is organized as follows:
EVscope/
βββ EVscope.conf # Configuration file for tool and reference paths
βββ EVscope.sh # Main pipeline script (v1.0.0)
βββ README.md # This documentation
βββ bin/ # Custom scripts for pipeline steps
β βββ Step_02_calculate_ACC_motif_fraction.py # Calculates ACC motif fractions
β βββ Step_02_plot_fastq2UMI_motif.py # Visualizes UMI motif distributions
β βββ Step_03_plot_fastq_read_length_dist.py # Plots read length distributions
β βββ Step_03_UMIAdapterTrimR1.py # Trims UMI-derived adapters
β βββ Step_07_bam2strand.py # Determines library strandedness
β βββ Step_08_convert_CIRCexplorer2CPM.py # Normalizes CIRCexplorer2 circRNA output
β βββ Step_09_convert_CIRI2CPM.py # Normalizes CIRI2 circRNA output
β βββ Step_10_circRNA_merge.py # Merges circRNA results
β βββ Step_13_gDNA_corrected_featureCounts.py # Generates gDNA-corrected counts
β βββ Step_15_combine_total_RNA_expr_matrix.py # Combines RNA expression matrices
β βββ Step_15_featureCounts2TPM.py # Converts featureCounts to TPM
β βββ Step_15_plot_RNA_distribution_1subplot.py # RNA distribution plots (1 subplot)
β βββ Step_15_plot_RNA_distribution_2subplots.py # RNA distribution plots (2 subplots)
β βββ Step_15_plot_RNA_distribution_20subplots.py # RNA distribution plots (20 subplots)
β βββ Step_15_plot_top_expressed_genes.py # Plots top expressed genes
β βββ Step_17_RSEM2expr_matrix.py # Converts RSEM to expression matrix
β βββ Step_18_plot_reads_mapping_stats.py # Visualizes genomic region mapping
β βββ Step_22_run_RNA_deconvolution_ARIC.py # Performs tissue deconvolution
β βββ Step_24_generate_QC_matrix.py # Compiles QC metrics
β βββ Step_25_bigWig2Expression.py # Converts bigWig to CPM/TPM
β βββ Step_25_EMapper.py # EM-based read coverage estimation
β βββ Step_26_density_plot_over_meta_gene.sh # Density plots for meta-gene regions
β βββ Step_26_density_plot_over_RNA_types.sh # Density plots for RNA types
β βββ Step_27_html_report.Rmd # Generates HTML report
βββ figures/ # Pipeline visualization
β βββ EVscope_pipeline.png # Pipeline overview image
βββ references/ # Reference genomes, annotations, and indices
β βββ annotations_HG38/ # Human genome annotations
β βββ deconvolution_HG38/ # Deconvolution reference matrices
β βββ genome/ # Reference genomes
β βββ index/ # Aligner indices
βββ soft/ # Bundled external tools
βββ bbmap # BBMap tools
βββ CIRI_v2.0.6 # CIRI2 for circRNA detection
βββ kraken2 # Kraken2 for taxonomic classification
βββ KrakenTools # Kraken2 helper scripts
βββ RSEM_v1.3.3 # RSEM for quantification
- Operating System: Linux (e.g., Ubuntu 20.04+) or macOS.
- Bash: Version 4.0 or higher.
- Conda: Miniconda or Anaconda for environment management.
- Core Tools:
- FastQC (v0.12.1), umi_tools (v1.1.5), cutadapt (v4.9)
- STAR (v2.7.11b), samtools (v1.21), featureCounts (v2.0.6)
- CIRCexplorer2 (v2.3.8), CIRI2 (v2.0.6), RSEM (v1.3.3)
- BBMap (v39.15), Kraken2, ribodetector (v0.3.1)
- seqtk (v1.4), BWA (v0.7.18), Picard (v3.3.0), deepTools (v3.5.5)
- R (v4.3.1) with rmarkdown, DT, kableExtra, bookdown, ggplot2, dplyr
- Python (v3.10.0) with pandas, numpy, matplotlib, biopython, numba, pyBigWig, pysam
- CPU: 20+ threads recommended for optimal performance.
- RAM: Minimum 64 GB; 250 GB recommended for Picard tools.
- Storage: 500 GB+ for input data, references, and outputs.
-
Clone the Repository:
git clone https://github.com/TheDongLab/EVscope.git cd EVscope -
Install Conda (if not already installed):
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh source ~/.bashrc
-
Create Conda Environments:
conda env create -f environments/evscope_env.yml conda env create -f environments/picard_env.yml conda env create -f environments/kraken2_env.yml
-
Install CIRCexplorer2:
conda activate evscope_env pip install CIRCexplorer2==2.3.8
-
Download Reference Files: Reference annotation files (HG38, GENCODE v45) are available on Zenodo. Download
EVscope_annotations_HG38.zipand extract toreferences/annotations_HG38/. -
Test Installation:
conda activate evscope_env fastqc --version STAR --version samtools --version python --version
bash EVscope.sh --sample_name <name> --input_fastqs <files> [options]Required Arguments:
--sample_name <name>: Unique sample identifier (used for output files).--input_fastqs <files>: Comma-separated FASTQ file paths (e.g.,R1.fq.gz,R2.fq.gzfor paired-end).
Optional Arguments:
| Option | Description | Default |
|---|---|---|
--threads <int> |
Number of CPU threads | 1 |
--run_steps <list> |
Steps to run (e.g., 1,3,5-8, all) |
all |
--skip_steps <list> |
Steps to skip (e.g., 2,4) |
None |
--circ_tool <tool> |
circRNA detection tool (CIRCexplorer2, CIRI2, both) |
both |
--gDNA_correction <yes|no> |
Apply genomic DNA correction | no |
--strandedness <strand> |
Library strandedness (forward, reverse, unstrand) |
reverse |
--config <path> |
Custom configuration file | EVscope.conf |
-V, --verbosity <level> |
Logging level (1=DEBUG, 2=INFO, 3=WARN, 4=ERROR) | 2 |
-h, --help |
Display help message | - |
-v, --version |
Show pipeline version | - |
bash EVscope.sh --sample_name Example_Data \
--input_fastqs R1.fq.gz,R2.fq.gz \
--threads 20 \
--run_steps all \
--gDNA_correction yes \
--strandedness reverse \
--verbosity 2- FASTQ Files: Gzipped, paired-end (
R1.fastq.gz,R2.fastq.gz) or single-end. - Sequencing Protocol: Optimized for SMARTer Stranded Total RNA-Seq Kit v3 (Pico Input) with 14-bp UMIs in Read2.
- Quality: High-quality reads suitable for EV RNA-seq.
| Step | Description |
|---|---|
| 1 | Raw FASTQ quality control using FastQC |
| 2 | UMI motif analysis and ACC motif fraction calculation |
| 3 | UMI extraction, adapter trimming, and read-through UMI removal |
| 4 | Quality control of trimmed FASTQs |
| 5 | Bacterial contamination screening (E. coli, Mycoplasma) using BBSplit |
| 6 | Two-pass STAR alignment with UMI deduplication |
| 7 | Library strandedness detection; splice/kb DNA contamination metric (primary source: STAR Log.final.out from Step 6) |
| 8 | CIRCexplorer2-based circular RNA detection |
| 9 | CIRI2-based circular RNA detection using BWA alignments |
| 10 | Merging of CIRCexplorer2 and CIRI2 circRNA results |
| 11 | RNA-seq metrics collection using Picard |
| 12 | featureCounts quantification (unique-mapping mode) |
| 13 | Genomic DNA-corrected featureCounts quantification |
| 14 | RSEM quantification (multi-mapping mode) |
| 15 | featureCounts-based expression matrix and RNA distribution plots |
| 16 | gDNA-corrected expression matrix |
| 17 | RSEM-based expression matrix |
| 18 | Genomic region read mapping analysis |
| 19 | Taxonomic classification using Kraken2 |
| 20-22 | Tissue deconvolution (featureCounts, gDNA-corrected, RSEM) |
| 23 | rRNA detection using ribodetector |
| 24 | Comprehensive quality control summary generation |
| 25 | Coverage analysis and bigWig generation using EMapper |
| 26 | Coverage density plots for RNA types and meta-gene regions |
| 27 | Final interactive HTML report generation |
Each sample generates an output directory with the following structure:
<sample_name>_EVscope_output/
βββ Step_01_Raw_QC/ # FastQC reports for raw reads
βββ Step_02_UMI_Analysis/ # UMI motif analysis and ACC fraction
βββ Step_03_Trimming/ # Trimmed FASTQ files
βββ Step_04_Trimmed_QC/ # FastQC reports for trimmed reads
βββ Step_05_Contamination_Filter/ # BBSplit contamination screening
βββ Step_06_Alignment_Refined/ # STAR alignment and UMI-deduplicated BAM
βββ Step_07_Strand_Detection/ # Strandedness and splice/kb metrics
βββ Step_08_CIRCexplorer2/ # CIRCexplorer2 circRNA results
βββ Step_09_CIRI2/ # CIRI2 circRNA results
βββ Step_10_circRNA_Merged/ # Merged circRNA results (CPM-normalized)
βββ Step_11_RNA_Metrics/ # Picard RNA-seq metrics
βββ Step_12_featureCounts_Quant/ # featureCounts gene quantification
βββ Step_13_gDNA_Corrected/ # gDNA-corrected quantification
βββ Step_14_RSEM_Quant/ # RSEM quantification
βββ Step_15_featureCounts_Expression/ # Expression matrices (TPM/CPM)
βββ Step_16_gDNA_Expression/ # gDNA-corrected expression matrices
βββ Step_17_RSEM_Expression/ # RSEM expression matrices
βββ Step_18_Genomic_Regions/ # Meta-gene region mapping stats
βββ Step_19_Taxonomy/ # Kraken2 taxonomic classification
βββ Step_20-22_Deconvolution/ # Tissue deconvolution results
βββ Step_23_rRNA_Detection/ # ribodetector rRNA detection
βββ Step_24_MultiQC_Summary/ # QC summary matrix
βββ Step_25_EMapper_BigWig_Quantification/ # EMapper coverage and bigWig
βββ Step_26_BigWig_Density_Plot/ # RNA type and meta-gene density plots
βββ Step_27_HTML_Report/ # Interactive HTML report
βββ EVscope_pipeline.log # Pipeline execution log
- Dependency Not Found: Verify Conda environments with
conda list -n evscope_env. - Reference File Missing: Check
EVscope.confpaths and file existence. - Memory Issues: Picard requires up to 250 GB RAM. Reduce
--threadsor use a high-memory server. - Step Failure: Review logs in
<output_dir>/EVscope_pipeline.log.
Q: Can EVscope process non-SMARTer-seq data?
A: Yes, modify UMI parameters in bin/Step_02_*.py and bin/Step_03_UMIAdapterTrimR1.py.
Q: How do I run specific pipeline steps?
A: Use --run_steps, e.g., --run_steps 1,3,5-8.
Q: How do I view the final report?
A: Open Step_27_HTML_Report/<sample_name>_final_report.html in a web browser.
We welcome contributions! To contribute:
- Fork the repository: https://github.com/TheDongLab/EVscope.
- Create a feature branch:
git checkout -b feature/YourFeature. - Submit a pull request.
Please report bugs via GitHub Issues.
If you use EVscope in your research, please cite:
Zhao, Yiyong, et al. "EVscope: A Comprehensive Bioinformatics Pipeline for Accurate and Robust Analysis of Total RNA Sequencing from Extracellular Vesicles." bioRxiv (2025). Zenodo: https://doi.org/10.5281/zenodo.15577788
Authors:
- Yiyong Zhao: Data curation, Formal analysis, Software, Visualization
- Himanshu Chintalapudi: Visualization
- Ziqian Xu: Resources
- Weiqiang Liu: Data curation
- Yuxuan Hu: Validation
- Ewa Grassin: Resources [supporting]
- Xianjun Dong: Conceptualization, Methodology, Funding, Supervision
Affiliations:
- Stephen & Denise Adams Center for Parkinson's Disease Research of Yale School of Medicine, New Haven, CT 06510, USA
- Department of Neurology, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
- Aligning Science Across Parkinson's (ASAP) Collaborative Research Network, Chevy Chase, MD 20815, USA
- Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Harvard University, Boston, MA, USA
Funding:
- NIH Grants: 1R01NS124916, 1R24NS132738
- Aligning Science Across Parkinson's: ASAP-000301, ASAP-000529 (via Michael J. Fox Foundation)
Data Availability: Source code: https://github.com/TheDongLab/EVscope and Zenodo (https://doi.org/10.5281/zenodo.15577788) under MIT license. Raw sequencing data: NCBI SRA (accession: SRR31350808βSRR31350811).
Corresponding Author: Xianjun Dong (xianjun.dong@yale.edu)
EVscope is licensed under the MIT License.
- Xianjun Dong: xianjun.dong@yale.edu
- GitHub: https://github.com/TheDongLab/EVscope
