Adapter trimming is the first step for analyzing small RNA sequencing data where reads are longer than target RNAs with lengths ranging from 18 to 30 bp. There is a lack of tools for accurately identifying adapters from raw reads. Moreover, the use of randomized adapters to reduce ligation biases in small RNA-seq library preparation makes adapter detection even more challenging.
FindAdapt is a Python package for identifying adapters for small RNA sequencing data without relying on prior information.
FindAdapt is a stand-alone Python package (python >=3.6).
wget https://github.com/chc-code/findadapt/archive/refs/heads/master.zip
unzip master.zip # the output folder will be findadapt-master
# use FindAdapt
cd findadapt-master
./findadapt -hThe installation of pyahocorasick is optional, but recommended.
# install pyahocorasick
pip install pyahocorasick
A docker image is also available at https://hub.docker.com/r/chccode/findadapt
(pyahocorasick is contained, and the findadapt script is set as the entrypoint.)
docker pull chccode/findadapt
# get the help information if no arguments are specified
docker run chccode/findadapt
# suppose your fastq file is under /data/folder1/folder2/reads.fastq.gz
docker run -v /data:/data chccode/findadapt /data/folder1/folder2/reads.fastq.gz
You can also use Singularity if docker is not available
singularity build findadapt.sif docker://chccode/findadapt
# get the help information if no arguments are specified
singularity run findadapt.sif
# suppose your fastq file is under /data/folder1/folder2/reads.fastq.gz
docker run -B /data findadapt.sif /data/folder1/folder2/reads.fastq.gz
#identify adapters for the fastq file from human
findadapt reads.fastq.gz
#identify adapters for the fastq file from mice
findadapt reads.fastq.gz -organism mouse
#list all the organisms that FindAdapt supports
findadapt -list_org
# identify adapters for a list of fastq files from mice
findadapt -fn_fq_list fq_list.txt -organism mouse
# identify and trim adapters using cutadapt package
findadapt reads.fastq.gz -pw_cutadapt path/to/cutadapt -cut
Users: can only select one option (either -fq or -prj) as the input
fn_fq_fileOptional positional argument, the path for single fastq file-fn_fq_list / -list / -l file_lista tab-delimited file, containing the list of fastq files. column1 = study ID, column2 = path of the fastq file.
either a list of sequences (fasta format or one sequence per line) by '-fn_refseq' or organism name by -organism
-fn_refseq filenamea list of sequences in fasta format or one sequence per line.-organism / -org strorganism name (such as human, mouse, fruitfly, worm, arabidopsis, rice or the miRBase prefix, such as hsa, mmu, dme, cel, ath, osa); default: human.-list_orglist the supported organisms
-o prefix,str, optional, the prefix for the output results, if not specified, will infer from the input file-quiet / -q, toggle, suppress the warning message if pyahocorasick not installed-cut / -cutadapt/ -trimflag, run the cutadapt process; require the cutadapt already installed and available in PATH-pw_cutadapt strthe path of cutadapt, the default is from PATH-v / -verboseflag, display the log information in the terminal
-expected_adapter_len intthe length of adapter sequence, default = 12 bp-max_random_linker intthe maximum length of random-mer, default = 8 bp-nreads intthe maximum number of reads used to find adapter, default: 1 million, if use all reads, set as -1-nsam intthe number of samples foradapter identification in a file list, default is all samples. Only valid when -fn_fq_list is specified-thres_multiplier floatthe threshold of the ratio between the count of the child and the count of the parent, default=1.2; if >1.2, save the child record; otherwise, save the parent record-min_reads intthe minimum number of matched reads for adapter identification, default=30. if lower than this value, the adapter identification will fail and users may need to check the reference settings.-threads / -cpu intthe number of threads, default = 5.-enough_reads intthe number of matched reads for adapter identification, default=1000-f-forceflag, force rerun the analysis, ignoring the exisiting parsed reads, can be useful when use a new reference.
We provided several fastq files from three studies
- GSE106303, the adapter sequence is not specified in the GEO database or the literature
- GSE122068, generated by NextFLEX library preparation kit where reads have 4N random sequence at both the 5' and 3' ends
- GSE137617, generated by SMARTer library preparation kit where multiple (usually 3 nt) random bps at the 5' end and polyA as the 3' adapter sequence
To identify adapter sequences
./findadapt <fn_fq>
for example, GSE122068.nextflex.SRR8144939.truncated.fastq.gz
./findadapt ./demo/GSE122068.nextflex.SRR8144939.truncated.fastq.gz
2023-09-08 08:13:02 INFO <module> line: 1683 1/1: single - using 1/ 1 fq files
2023-09-08 08:13:02 INFO get_adapter_per_prj line: 1076 processing GSE122068.nextflex.SRR8144939.truncated.fastq.gz
2023-09-08 08:13:02 INFO get_parsed_reads line: 834 matched reads found: 1177
2023-09-08 08:13:02 INFO export_data line: 1229 most possible kit = NEXTflex
2023-09-08 08:13:02 INFO export_data line: 1289 result per-prj = GSE122068.nextflex.SRR8144939.truncated.adapter.txt
2023-09-08 08:13:02 INFO export_data line: 1290 result per-fq = GSE122068.nextflex.SRR8144939.truncated.per_fq.adapter.txt
The output contains the following columns: Prj: The output prefix, if the input is a single fastq file rather than a fastq file list (-fn_fq_list), it will be "single" total_reads: Total matched reads used for adapter identification 3p_seq: The sequence of 3' adapter 3p_phase: the random sequence length before 3' adapter 3p_count / 3p_ratio: The number and ratio of reads supporting this 3' adapter sequence and random sequence length 5p_phase: the random sequence length before the insert 5p_count / 5p_ratio: The number and ratio of reads supporting this 5' random sequence length err: the error information if fail to get the adapter sequence,
| prj | total_reads | 3p_seq | 3p_phase | 3p_count | 3p_ratio | 5p_phase | 5p_count | 5p_ratio | err |
|---|---|---|---|---|---|---|---|---|---|
| single | 1177 | TGGAATTCTCGG | 4 | 1021 | 0.8667 | 4 | 1143 | 0.9711 |
The detail adapter information of each input fastq file
| prj | fastq | total_reads | side | sn | seq | phase | count | ratio |
|---|---|---|---|---|---|---|---|---|
| single | GSE122068.nextflex.SRR8144939.truncated | 1177 | 3p | 1 | TGGAATTCTCGG | 4 | 1021 | 0.8675 |
| single | GSE122068.nextflex.SRR8144939.truncated | 1177 | 3p | 2 | CTGGAATTCTCG | 3 | 633 | 0.5378 |
| single | GSE122068.nextflex.SRR8144939.truncated | 1177 | 5p | 1 | 4 | 1143 | 0.9711 |
Users can remove the adapter using the identified pattern by specifying -cut
Or use the output to build their own cutadapt command.
# if 3p_seq is empty and 5p_phase > 0:
cutadapt -u {5p_phase} -m 15 -j 8 --trim-n {fn_fq} -o {fn_out}
# elif 3p seq is not empty and 5p_phase = 3p_phase = 0
cutadapt -a {seq_3p} -m 15 -j 8 --trim-n {fn_fq} -o {fn_out}
# if 3p_phase > 0 and 5p_phase == 0
cutadapt -a {seq_3p} -j 8 --trim-n {fn_fq} |cutadapt -u -{3p_phase} -m 15 -o {fn_out}
# if 3p_phase = 0 and 5p_phase > 0
cutadapt -a {seq_3p} -j 8 --trim-n {fn_fq} |cutadapt -u {5p_phase} -m 15 -o {fn_out}
# if 3p_phase > 0 and 5p_phase > 0
cutadapt -a {seq_3p} -j 8 --trim-n {fn_fq} |cutadapt -u -{3p_phase} -u {5p_phase} -m 15 -o {fn_out}