cfCOP

!!!cfCOP is currently in its beta development phase. While we are actively improving and testing the tool, please be aware that it may still have some bugs or incomplete features. We encourage you to use it and share feedback, but do so with caution in critical environments!!!

cfCOP (cell-free Community observational Pipeline) is a computational workflow for pathogen identification from high-thoughput cell-free DNA samples. It was specifically developed to analyze cell-free DNA samples sequenced on Illumina and Oxford Nanopore platforms.

Installation

Hardware prerequisites:

250 GB RAM
multiple CPUs are advantageous, but not obligatory
310 GB free hard drive storage (for the database)
temporary hard drive storage of ca. 4x of the fastq.gz size (for analysis)

Software prerequisites:

Linux OS (cfCOP was tested on Ubuntu distributions)
miniconda (see miniconda quick-installation manual here)
git (installation instructions here)

cfCOP is a Nextflow-based pipeline. All dependencies including Nextflow are provided in conda environment cfCOPenv.yml

cd /your/path/to/cfCOP
conda env create -f cfCOPenv.yml

Before using cfCOP activate its environment:

conda activate cfCOPenv.yml

Connecting a database

To retrieve a cfCOP metagenomic database use the command:

wget "https://www.dropbox.com/scl/fi/asu9s784zb2iy7cam7w4k/2024_12_07_cfCOPdb_export.tar.gz?rlkey=edlenq7i34mylmws7lfc5grcq&e=1&st=sn0nnbyf&dl=1" -O /your/path/to/2024_12_07_cfCOPdb_export.tar.gz

Unpack the database with:

tar -zxvf /your/path/to/2024_12_07_cfCOPdb_export.tar.gz

If cfCOP database is not connected, the tool will communicate that to you. To initially connect a database:

/your/path/to/run_cfCOP.py --connect_db /your/path/to/2024_12_07_cfCOPdb_export/

Once connected all paths the database are stored in the nextflow.config file.

Please pay attention, that only minimap2 index is provided. If you wish to use BWA mapper for your analysis (preferably for Illumina, and not for Nanopore data), BWA index can be created on your machine from a provided fasta file using command:

bwa index /your/path/to/combined_db.fasta

Afterwards please manually change the default path to parameter bwa_index in nextflow.config file.

Basic usage

cfCOP is run via invoking run_cfCOP.py script.

run_cfCOP.py script requires either:

an input fastq-file(s) of sequenced sample(s)
a sample sheet (please see the sample sheet format here)

Do not supply both sample sheet and additional samples, instead add your additional samples to a samplesheet and supply the samplesheet. Without further specifications cfCOP will run with default parameters specified in nextflow.config and/or in the samplesheet.

Running cfCOP with input fastq file(s)

Both fastq and fastq.gz are possible.

With a single fastq file:

/your/path/to/run_cfCOP.py --inputs /path/to/file.fastq

With multiple fastq files:

/your/path/to/run_cfCOP.py --inputs /path/to/files/file1.fastq /path/to/files/file2.fastq.gz

or with bash wildcards

/your/path/to/run_cfCOP.py --inputs /path/to/files/file*.fastq.gz

Without an explicitly given mapper cfCOP uses a default mapper from nextflow.config file supplied with the pipeline. With --mapper option you set a desired mapper for the analysis. Available mapper choices: minimap2, bwa.

/your/path/to/run_cfCOP.py --inputs /path/to/file.fastq --mapper minimap2

Without an explicitly given sequencing platform cfCOP assumes data was produced on a default sequencing platform from nextflow.config file supplied with the pipeline. With --platform option you set a desired sequencing platform for the analysis. Available platform choices: illumina, nanopore. Choice of a sequencing platform influences the data mapping mode.

/your/path/to/run_cfCOP.py --inputs /path/to/file.fastq --platform illumina

Running cfCOP with a samplesheet

/your/path/to/run_cfCOP.py --samplesheet /your/path/to/samplesheet.txt

Samplesheet overwrites the default mapper and sequencing platform for each sample. Samplesheet enables analyss of illumina and nanopore samples with both bwa and minimap2 mapper in one run.

Samplesheet is a tab-delimited file with linux line endings. Column order is pre-defined in header.

Columns of the samplesheet are:

id (mandatory) - sample id. It should be unique to each sample

platform (mandatory) - sequencing platform. Choices: illumina, nanopore

mapper (mandatory) - mapper that should be used for the mapping step. Choices: minimap2, bwa

fastq1 (mandatory) - absolute path to the fastq or fastq.gz file

fastq2 (optional) - reversed reads for Illumina paired end samples

sorted_sam (optional) - samfile sorted by read name. The mapping step will be skipped then.

As an example, see provided template_samplesheet.txt

Comand-Line arguments

Argument	Description
`-h, --help`	Show this help message and exit.
`--samplesheet SAMPLESHEET`	Path to the sample sheet.
`--inputs INPUTS [INPUTS ...]`	List of input files (for multiple files).
`--mapper {minimap2,bwa}`	Specify the aligner to use. Choices are: `minimap2` or `bwa`.
`--platform {nanopore,illumina}`	Specify the sequencing platform of sample(s). Choices are: `nanopore` or `illumina`.
`--kmer_cutoff KMER_CUTOFF`	Report generation: Kmer index cutoff should be a float number between 0 and 1.
`--count_cutoff COUNT_CUTOFF`	Report generation: Number of reads assigned to a taxon in order to be reported. Should be an integer.
`--fraction_cutoff FRACTION_CUTOFF`	Report generation: Fraction of reads (of total reads) assigned to a taxon in order to be reported. Should be a float number between 0 and 1.
`--connect_db CONNECT_DB`	Connect to a database by filling the paths of the database files into the `nextflow.config`.
`--resume`	Resume the pipeline with changed parameters or for debugging purposes. Cached unchanged processes will not be re-run.
`--ro_report`	Per default, cfCOP creates an html report of the current run. To supress reporting, set this flag.

Advanced configuration

cfCOP can be further configured via nextflow.config file. Changing this file changes cfCOP default parameters. Below is a list of parameters that can be configured for the pipeline.

General Parameters

bwa_index
Path to the BWA index for alignment.
minimap2_index
Path to the Minimap2 index for alignment.
database_fasta
Path to the FASTA file for the database.
taxonomy
Path to the taxonomy database folder.

Filters

complexity_filter
Path to the complexity filter file.
homology_filter
Path to the homology filter file.
plasmid_filter
Path to the plasmid filter file.

Resource Parameters

threads
Number of threads for parallel processing.
threads_samtools
Number of threads for Samtools processing.
kmer_size
Size of the k-mer used in analysis.

Filtering and Reporting

taxrank
Taxonomic ranks to consider in reporting. Options: ["species", "genus", "family"].
platform
Sequencing platform used. Options: "illumina", "nanopore".
mapper
Alignment tool to be used. Options: "minimap2", "bwa".
kmer_cutoff
K-mer index cutoff for reporting. Must be a float between 0 and 1.
count_cutoff
Minimum number of reads assigned to a taxon for reporting.
fraction_cutoff
Minimum fraction of reads (relative to total reads) assigned to a taxon for reporting. Must be a float between 0 and 1.

Input Parameters

input_files
Path to input files. Can be null or a list of paths.
samplesheet
Path to the sample sheet. Can be null or a valid file path.

Executor Settings

$sge.queueSize
Maximum number of queued jobs for SGE executor.
$local.cpus
Number of CPUs for local execution.
$local.memory
Memory allocation for local execution.

Profiles

`standard`

Uses the local executor.
Configuration:

process.executor = 'local'

`cluster`

Execution on a cluster (using PBSPro). To enable the cluster profile, use the -profile cluster option when running the pipeline. Adjust the cluster options (params.project_cluster_opts) as needed for your environment.

Executor settings:

process.executor
Specifies the executor to use. For the cluster profile, this is set to 'pbspro'.
params.project_cluster_opts
Cluster-specific options passed to the job scheduler.

For the cluster execution job are submitted depending on the memory and ime consumption.

Big Memory Jobs

cpus: params.threads
Number of CPUs allocated for big memory jobs.
memory: 300 GB
Memory allocated for big memory jobs.
time: 1h
Maximum runtime for big memory jobs.

Low Memory Jobs

cpus: 1
Number of CPUs allocated for low memory jobs.
memory: 10 GB
Memory allocated for low memory jobs.
time: 5h
Maximum runtime for low memory jobs.

Outputs

Read statistics (.readstat)

Read statistics include mapping and filtering statistics for each read. A readstat file consists of one row per read. Read format includes 8 mandatory fields:

#flag - Status flag indicating the processing or filtering status of the read.
name - Unique identifier or name of the read.
taxid - Taxonomy ID assigned to the read based on mapping results.
length - Length of the read in base pairs (bp).
stats - General mapping statistics for the read, such as alignment quality or coverage.
complexity - Indicates whether the read was filtered due to low sequence complexity (e.g., 0 for pass, 1 for fail).
homology - Indicates whether the read was filtered due to homology to undesired sequences (e.g., host DNA).
plasmid - Indicates whether the read was identified as originating from a plasmid.

Example:

#flag	name	taxid	length	stats	complexity	homology	plasmid
0	read1	562	150	95.2	0	0	0
1	read2	9606	200	89.4	1	0	0
0	read3	1234	120	92.8	0	1	1

read1: Passed all filters and is assigned to taxid 562.
read2: Failed the complexity filter.
read3: Passed the complexity filter but failed the homology and plasmid filters.

Taxonomy statistics (.taxstat)

Taxonomy statistics summarized the read statistics on a taxon level.

Report

The compact overview of the detected taxons on family, genus and species levels

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
bin		bin
modules		modules
workflows		workflows
LICENSE		LICENSE
README.md		README.md
SOP_cfDNA_extraction.pdf		SOP_cfDNA_extraction.pdf
cfCOPenv.yml		cfCOPenv.yml
main.nf		main.nf
nextflow.config		nextflow.config
run_cfCOP.py		run_cfCOP.py
template.config		template.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cfCOP

Installation

Connecting a database

Basic usage

Running cfCOP with input fastq file(s)

Running cfCOP with a samplesheet

Comand-Line arguments

Advanced configuration

General Parameters

Filters

Resource Parameters

Filtering and Reporting

Input Parameters

Executor Settings

Profiles

`standard`

`cluster`

Outputs

Read statistics (.readstat)

Taxonomy statistics (.taxstat)

Report

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

DiltheyLab/cfCOP

Folders and files

Latest commit

History

Repository files navigation

cfCOP

Installation

Connecting a database

Basic usage

Running cfCOP with input fastq file(s)

Running cfCOP with a samplesheet

Comand-Line arguments

Advanced configuration

General Parameters

Filters

Resource Parameters

Filtering and Reporting

Input Parameters

Executor Settings

Profiles

standard

cluster

Outputs

Read statistics (.readstat)

Taxonomy statistics (.taxstat)

Report

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`standard`

`cluster`

Packages