Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files.
- Introduction
- How to run the pipeline
- Samplesheet input
- Available genomes
- Proposed run modes
- Definition of structural parameters
- Additional customizable parameters
- Custom mutation calls
The typical command for running the pipeline is as follows:
nextflow run bbglab/deepCSA --outdir <OUTDIR> -profile <DESIRED PROFILE> --input samplesheet.csvFor more information on how to run Nextflow pipelines check a more detailed explanation below in this same document or check the Nextflow or nf-core community documentations.
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.
--input '[path to samplesheet file]'Example:
sample,vcf,bam
sample1,sample1.high.filtered.vcf,sample1.sorted.bam
sample2,sample2.high.filtered.vcf,sample2.sorted.bam| Column | Description |
|---|---|
sample |
Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Sample names cannot contain dots (.). Ideally the sample name should have a Python string-like format, it should not be a single number. |
vcf |
Full path to VCF file containing all the mutations called in your sample. It should be uncompressed and with the VCF format field complying with the expected format. See custom mutation calling below in case the input is not coming from deepUMIcaller. |
bam |
Full path to BAM file containing the duplex aligned reads that were used for the variant calling. |
An example samplesheet has been provided with the pipeline.
deepCSA pipeline heavily relies on bgreference and bgdata tools so the use of this pipeline is limited to those genomes available in these packages. In particular, the default containers that are being used already have the hg38 and mm39 genomes cached, if you want to use any other genome, open an issue and we will address it as soon as we can.
These are 4 different ways of running the pipeline, each of them serving for a specific purpose, a list of expected outputs is provided in each run mode section.
It will provide:
- Definition of regions to analyze
- Depth per sample and/or per gene
- Somatic mutations
- Needle plots
- Mutational profile
- Mutational signatures
params {
plot_depths = true
signatures = true
profileall = true
}It will provide:
- All the previously described outputs plus...
- Mutation density
- Positive selection per gene
multiple positive selection metrics
- Per gene, all samples together
- Per gene, per group of samples
- Per gene, per sample
params {
mutationdensity = true
profileall = true
oncodrivefml = true
oncodriveclustl = true
oncodrive3d = true
o3d_raw_vep = true
o3d_plot = true
omega = true
omega_multi = true
omega_globalloc = true
omega_mutabilities = true
site_comparison_grouping = 'all'
omega_plot = true
omega_withingene = true
omega_autodomains = true
omega_autoexons = true
mutated_cells_vaf = true
mutepi_genes_to_recode = null
indels = true
signatures = true
}Mutational processes in alternative genomic regions. Partial run with a focus on mutational processes/signatures
It will provide:
- Same as initial run (even it can be ignored)
- Mutational profile and mutational signatures based on:
- All genomic regions
- Only exonic regions
- Only non-protein affecting regions (synonymous mutations and intronic, intergenic)
- Intronic and intergenic regions
params {
mutationdensity = true
profileall = true
profilenonprot = true
profileexons = true
profileintrons = true
signatures = true
}Interindividual variability and sample comparison. Complete run with downstream steps for computation of linear regressions to compare different samples/groups based on clinical variables or sample metadata
It will provide:
- Same as complete clonal structure definition +
- Computation of univariate and multivariate linear regressions between clonal structure metrics and clonal selection
params {
mutationdensity = true
profileall = true
omega = true
omega_multi = true
omega_globalloc = true
omega_mutabilities = true
site_comparison_grouping = 'all'
omega_plot = true
omega_withingene = true
omega_autodomains = true
omega_autoexons = true
regressions = true
// additional regression parameters, see nextflow_schema.json for more info
...
}-
Container pulling (either prior to running the pipeline or directly as the pipeline runs)
-
Generation of Oncodrive3D datasets (see: Oncodrive3D repo datasets building process)
-
Download of additional specific datasets
- Ensembl VEP (see: Ensembl VEP docs). Modify accordingly your
nextflow.configvep parameters,vep_cache,vep_cache_version, etc.
- CADD scores (see: CADD downloads page "All possible SNVs of GRCh38/hg38" file)
- COSMIC signatures (i.e. COSMIC signatures downloads page (select context size = 96 and your desired species of interest))
- Ensembl VEP (see: Ensembl VEP docs). Modify accordingly your
-
Provide custom domain definition file.
See File formatting docs for more details on the structure of files that can be provided to deepCSA.
params {
fasta = null
cosmic_ref_signatures = "COSMIC_v3.4_SBS_GRCh38.txt"
wgs_trinuc_counts = "assets/trinucleotide_counts/trinuc_counts.homo_sapiens.tsv"
// oncodrivefml (only for human; could be adapted to others)
cadd_scores = "CADD/v1.7/hg38/whole_genome_SNVs.tsv.gz"
cadd_scores_ind = "CADD/v1.7/hg38/whole_genome_SNVs.tsv.gz.tbi"
// dnds
dnds_ref_transcripts = "RefCDS_human_latest_intogen.rda"
dnds_covariates = "covariates_hg19_hg38_epigenome_pcawg.rda"
// oncodrive3d + fancy plots
datasets3d = "oncodrive3d/datasets"
annotations3d = "oncodrive3d/annotations"
domains_file = "pfam.tsv"
vep_cache = ".vep"
// Ensembl VEP for homo_sapiens, but should be adjusted accordingly to species and cache version
vep_genome = "GRCh38"
vep_species = "homo_sapiens"
vep_cache_version = 111
vep_out_format = "tab"
vep_params = "--no_stats --cache --offline --symbol --protein --canonical --af_gnomadg --af_gnomade"
vep_params_panel = "--no_stats --cache --offline --symbol --protein --canonical"
}See File formatting docs for more details on the structure of files that can be provided to deepCSA.
params {
// definition of gene groups
// could be fixed or dynamic based on the study
custom_groups = false
custom_groups_file = null
custom_groups_separator = 'tab'
// customize the annotation of certain regions i.e. TERT promoter mutations, other non-coding drivers...
customize_annotation = false
custom_annotation_tsv = ''
// define a set of common known hotspots
hotspots_annotation = false
hotspots_definition_file = ''
// definition of specific regions within genes with specific interest on computing dN/dS
omega_subgenic_bedfile = null
// define a file of mutations that should not be trusted
// and you want to remove from all the analysis
blacklist_mutations = null
}These files identify sites overlapping common SNPs and noisy or variable genomic regions, as described in Abascal et al, 2021 and used in the Nanoseq pipeline. Two BED files are available to be used:
- Nanoseq SNP: Common SNP positions that should be excluded from analysis
- Nanoseq Noise: Regions with high noise or variability
Both files are available for GRCh37 and GRCh38 at the shared folder from Iñigo Martincorena's group, at the Wellcome Sanger Institute.
In addition to several files that can be provided as input listed in the optional files parameters, there are some more parameters that allow for specific tunnings of the analysis.
There are several depth thresholds that can be defined in the pipeline, I will list them below from the most strict to the least strict.
- consensus_panel_min_depth = 500
For a given genomic position to be included in the so called "consensus panel" this position needs to have a depth of at least consensus_panel_min_depth in at least 80% of the samples. This should always be the highest value among all the depth thresholds and it should be big enough to classify a mutation as somatic vs germline. It should be at least 40.
- sample_panel_min_depth = 40
This value impacts the creation of sample specific panels that capture which genomics positions have been sequenced to at least this depth in each specific sample. This should be big enough to classify a mutation as somatic vs germline. It should be at least 40.
- mutation_depth_threshold = 40
This value is used for filtering the mutations by depth. Meaning that if a mutation does not reach this minimum sequencing depth it will not be kept for further analysis. This value should be big enough to be able to classify a mutation as somatic vs germline, and reach a trustworthy computation of the mutation frequency. It should be at least 40.
- use_custom_minimum_depth = 0
This value is the less stringent depth threshold and is used in the first step of computing the positions that may be part of the so called "panels". This value indicates the minimum average depth at a given position for this position to be kept for the posterior depth analysis and definition on panels. The main use of this value should be to reduce the size of the files that are being processed afterwards. This can be set to 20 or more very safely.
If you want to run deepCSA with your own mutation calls, this is also possible. Reasons behind this would be:
- the variant calling was not done using deepUMIcaller.
- you came up with a set of mutations that you trust and want to force them as the ones to be used for the analysis.
For this, you will need to generate a VCF file per sample with the same format as that expected by deepCSA using the following script that you can find in the deepCSA repository in the following relative path:
assets/useful_scripts/deepcsa_maf2samplevcfs.py
The script itself contains this brief explanation on the usage and required parameters:
#######
# This script converts a mutations file (TSV format) to one or multiple VCF-formatted files.
#######
#######
# Usage:
#######
## If your sample names are NOT in a column called SAMPLE_ID,
## you can use the --sample-name-column option to specify it.
# if the maf is from deepCSA, use this one
# usage: python deepcsa_maf2samplevcfs.py --mutations-file all_samples.somatic.mutations.tsv --output-dir ./test/ --maf-from-deepcsa
# if the maf file is not from deepCSA, use this one
# usage: python deepcsa_maf2samplevcfs.py --mutations-file all_samples.somatic.mutations.tsv --output-dir ./test/
#######
# Mandatory columns in input mutations:
#######
# if the maf is from deepCSA, it must contain the following columns, as they were originally generated
# ['CHROM', 'POS', 'REF', 'ALT', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE']
# if the maf file is not from deepCSA, then it MUST contain the following columns
# ['CHROM', 'POS', 'REF', 'ALT', 'DEPTH', 'ALT_DEPTH']
# where:
# DEPTH indicates the total number of duplex reads sequenced at the position where the mutation occurs
# ALT_DEPTH indicates the total number of duplex reads supporting the variant at the same position
Make sure to prepare the input.csv file with matching the correct VCF-BAM files for each sample.
If you want to run deepCSA as a basic user and ensure that mutations are properly filtered stop here.
In case you are following these steps to run deepCSA with a set of mutations that you already filtered and trust there is one last thing that you should do.
When running the pipeline you should set the following parameters:
params {
no_filter = true
filter_criteria = []
filter_criteria_somatic = []
}