Motivation

Adapter trimming is the first step for analyzing small RNA sequencing data where reads are longer than target RNAs with lengths ranging from 18 to 30 bp. There is a lack of tools for accurately identifying adapters from raw reads. Moreover, the use of randomized adapters to reduce ligation biases in small RNA-seq library preparation makes adapter detection even more challenging.

About FindAdapt

FindAdapt is a Python package for identifying adapters for small RNA sequencing data without relying on prior information.

Installation

FindAdapt is a stand-alone Python package (python >=3.6).

Download and uncompress

wget https://github.com/chc-code/findadapt/archive/refs/heads/master.zip
unzip master.zip  # the output folder will be findadapt-master

# use FindAdapt
cd findadapt-master
./findadapt  -h

The installation of pyahocorasick is optional, but recommended.

# install pyahocorasick
pip install pyahocorasick

Docker / Singularity image

A docker image is also available at https://hub.docker.com/r/chccode/findadapt (pyahocorasick is contained, and the findadapt script is set as the entrypoint.)

docker pull chccode/findadapt

# get the help information if no arguments are specified
docker run chccode/findadapt

# suppose your fastq file is under /data/folder1/folder2/reads.fastq.gz
docker run -v /data:/data chccode/findadapt /data/folder1/folder2/reads.fastq.gz

You can also use Singularity if docker is not available

singularity build findadapt.sif docker://chccode/findadapt

# get the help information if no arguments are specified
singularity run findadapt.sif

# suppose your fastq file is under /data/folder1/folder2/reads.fastq.gz
docker run -B /data findadapt.sif /data/folder1/folder2/reads.fastq.gz

SYNOPSIS

  #identify adapters for the fastq file from human
  findadapt reads.fastq.gz
 
  #identify adapters for the fastq file from mice
  findadapt reads.fastq.gz -organism mouse
  
  #list all the organisms that FindAdapt supports
  findadapt -list_org
  
  # identify adapters for a list of fastq files  from mice 
  findadapt -fn_fq_list fq_list.txt -organism mouse

  # identify and trim adapters using cutadapt package 
  findadapt reads.fastq.gz -pw_cutadapt path/to/cutadapt -cut

COMMANDS AND OPTIONS

File input

Users: can only select one option (either -fq or -prj) as the input

fn_fq_file Optional positional argument, the path for single fastq file
-fn_fq_list / -list / -l file_list a tab-delimited file, containing the list of fastq files. column1 = study ID, column2 = path of the fastq file.

Reference sequences

either a list of sequences (fasta format or one sequence per line) by '-fn_refseq' or organism name by -organism

-fn_refseq filename a list of sequences in fasta format or one sequence per line.
-organism / -org str organism name (such as human, mouse, fruitfly, worm, arabidopsis, rice or the miRBase prefix, such as hsa, mmu, dme, cel, ath, osa); default: human.
-list_org list the supported organisms

Output options

-o prefix,str, optional, the prefix for the output results, if not specified, will infer from the input file
-quiet / -q , toggle, suppress the warning message if pyahocorasick not installed
-cut / -cutadapt/ -trim flag, run the cutadapt process; require the cutadapt already installed and available in PATH
-pw_cutadapt str the path of cutadapt, the default is from PATH
-v / -verbose flag, display the log information in the terminal

Other Options

-expected_adapter_len int the length of adapter sequence, default = 12 bp
-max_random_linker int the maximum length of random-mer, default = 8 bp
-nreads int the maximum number of reads used to find adapter, default: 1 million, if use all reads, set as -1
-nsam int the number of samples foradapter identification in a file list, default is all samples. Only valid when -fn_fq_list is specified
-thres_multiplier float the threshold of the ratio between the count of the child and the count of the parent, default=1.2; if >1.2, save the child record; otherwise, save the parent record
-min_reads int the minimum number of matched reads for adapter identification, default=30. if lower than this value, the adapter identification will fail and users may need to check the reference settings.
-threads / -cpu int the number of threads, default = 5.
-enough_reads int the number of matched reads for adapter identification, default=1000
-f -force flag, force rerun the analysis, ignoring the exisiting parsed reads, can be useful when use a new reference.

Examples

We provided several fastq files from three studies

GSE106303, the adapter sequence is not specified in the GEO database or the literature
GSE122068, generated by NextFLEX library preparation kit where reads have 4N random sequence at both the 5' and 3' ends
GSE137617, generated by SMARTer library preparation kit where multiple (usually 3 nt) random bps at the 5' end and polyA as the 3' adapter sequence

To identify adapter sequences
./findadapt <fn_fq>

for example, GSE122068.nextflex.SRR8144939.truncated.fastq.gz
./findadapt ./demo/GSE122068.nextflex.SRR8144939.truncated.fastq.gz

Output Format

log information

2023-09-08 08:13:02  INFO   <module>              line: 1683   1/1: single - using 1/ 1 fq files
2023-09-08 08:13:02  INFO   get_adapter_per_prj   line: 1076   	processing GSE122068.nextflex.SRR8144939.truncated.fastq.gz
2023-09-08 08:13:02  INFO   get_parsed_reads      line: 834    matched reads found: 1177
2023-09-08 08:13:02  INFO   export_data           line: 1229   	most possible kit = NEXTflex
2023-09-08 08:13:02  INFO   export_data           line: 1289   result per-prj = GSE122068.nextflex.SRR8144939.truncated.adapter.txt
2023-09-08 08:13:02  INFO   export_data           line: 1290   result per-fq = GSE122068.nextflex.SRR8144939.truncated.per_fq.adapter.txt

.adapter.txt

The output contains the following columns: Prj: The output prefix, if the input is a single fastq file rather than a fastq file list (-fn_fq_list), it will be "single" total_reads: Total matched reads used for adapter identification 3p_seq: The sequence of 3' adapter 3p_phase: the random sequence length before 3' adapter 3p_count / 3p_ratio: The number and ratio of reads supporting this 3' adapter sequence and random sequence length 5p_phase: the random sequence length before the insert 5p_count / 5p_ratio: The number and ratio of reads supporting this 5' random sequence length err: the error information if fail to get the adapter sequence,

prj	total_reads	3p_seq	3p_phase	3p_count	3p_ratio	5p_phase	5p_count	5p_ratio	err
single	1177	TGGAATTCTCGG	4	1021	0.8667	4	1143	0.9711

.per_fq.adapter.txt

The detail adapter information of each input fastq file

prj	fastq	total_reads	side	sn	seq	phase	count	ratio
single	GSE122068.nextflex.SRR8144939.truncated	1177	3p	1	TGGAATTCTCGG	4	1021	0.8675
single	GSE122068.nextflex.SRR8144939.truncated	1177	3p	2	CTGGAATTCTCG	3	633	0.5378
single	GSE122068.nextflex.SRR8144939.truncated	1177	5p	1		4	1143	0.9711

Trim the adapter using cutadapt

Users can remove the adapter using the identified pattern by specifying -cut Or use the output to build their own cutadapt command.

# if 3p_seq is empty and 5p_phase > 0:
cutadapt -u {5p_phase} -m 15 -j 8  --trim-n {fn_fq} -o {fn_out}

# elif 3p seq is not empty and 5p_phase = 3p_phase = 0
cutadapt -a {seq_3p} -m 15 -j 8  --trim-n  {fn_fq} -o {fn_out}

# if 3p_phase > 0 and 5p_phase == 0
cutadapt -a {seq_3p} -j 8 --trim-n  {fn_fq} |cutadapt -u -{3p_phase} -m 15 -o {fn_out}

# if 3p_phase = 0 and 5p_phase > 0
cutadapt -a {seq_3p} -j 8 --trim-n  {fn_fq} |cutadapt -u {5p_phase} -m 15 -o {fn_out}

# if 3p_phase > 0 and 5p_phase > 0
cutadapt -a {seq_3p} -j 8 --trim-n  {fn_fq} |cutadapt -u -{3p_phase} -u {5p_phase} -m 15 -o {fn_out}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data		data
demo		demo
utils		utils
.gitignore		.gitignore
.nogitbig		.nogitbig
LICENSE		LICENSE
README.md		README.md
findadapt		findadapt
simulation_detail.md		simulation_detail.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Motivation

About FindAdapt

Installation

Download and uncompress

Docker / Singularity image

SYNOPSIS

COMMANDS AND OPTIONS

File input

Reference sequences

Output options

Other Options

Examples

Output Format

log information

.adapter.txt

.per_fq.adapter.txt

Trim the adapter using cutadapt

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

chc-code/findadapt

Folders and files

Latest commit

History

Repository files navigation

Motivation

About FindAdapt

Installation

Download and uncompress

Docker / Singularity image

SYNOPSIS

COMMANDS AND OPTIONS

File input

Reference sequences

Output options

Other Options

Examples

Output Format

log information

.adapter.txt

.per_fq.adapter.txt

Trim the adapter using cutadapt

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages