Skip to content

rpetit3/assembly-scan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub release (latest by date) Anaconda-Server Badge Gitpod ready-to-code

assembly-scan reads an assembly in FASTA format and outputs summary statistics in TSV or JSON format

assembly-scan

I wanted a quick method to output simple summary statistics of an input assembly in TSV or JSON format. There are alternatives including assemblathon-stats.pl and assembly-stats, but they didn't output what I wanted. Thus assembly-scan was born.

Installation

Bioconda

assembly-scan is available on Bioconda.

conda create -n assembly-scan -c conda-forge -c bioconda assembly-scan

From Source

While I will always recommend using the Bioconda installation, the only dependency assembly-scan has is Python >=3.7. So, if you have that already you can use the script directly.

git@github.com:rpetit3/assembly-scan.git
cd assembly-scan
python3 bin/assembly-scan YOUR_ASSEMBLY.fasta

From there you can decide to add it to your PATH or not. But, again, I recommend just going the Bioconda route.

Usage

assembly-scan requires an assembly, gzip compressed or uncompressed, as input.

Usage

usage: assembly-scan [-h] [--json] [--transpose] [--prefix PREFIX] [--version] ASSEMBLY

Generate statistics for a given assembly.

positional arguments:
  ASSEMBLY         FASTA file to read (gzip or uncompressed)

options:
  -h, --help       show this help message and exit
  --json           Print output in a JSON format
  --transpose      Print output in a transposed tab-delimited format
  --prefix PREFIX  ID to use for output (Default: basename of assembly)
  --version        show program's version number and exit

Example Usage

Many FASTA files are available in the test directory. These include an uncompressed complete phiX174 genome and a compressed Staphylococcus aureus assembly. This script reads the input and outputs summary statistics in tab-delimited format to STDOUT.

Uncompressed

By default assembly-scan outputs the results in tab-delimited format. But for example purposes the --transpose option has been used. It is just easier to look at in the README.

assembly-scan test/phiX174.fna --transpose
test/phiX174.fna        sample  phiX174.fna
test/phiX174.fna        total_contig    1
test/phiX174.fna        total_contig_length     5386
test/phiX174.fna        max_contig_length       5386
test/phiX174.fna        mean_contig_length      5386
test/phiX174.fna        median_contig_length    5386
test/phiX174.fna        min_contig_length       5386
test/phiX174.fna        n50_contig_length       5386
test/phiX174.fna        l50_contig_count        1
test/phiX174.fna        num_contig_non_acgtn    0
test/phiX174.fna        contig_percent_a        23.97
test/phiX174.fna        contig_percent_c        21.48
test/phiX174.fna        contig_percent_g        23.28
test/phiX174.fna        contig_percent_t        31.27
test/phiX174.fna        contig_percent_n        0.00
test/phiX174.fna        contig_non_acgtn        0.00
test/phiX174.fna        contigs_greater_1m      0
test/phiX174.fna        contigs_greater_100k    0
test/phiX174.fna        contigs_greater_10k     0
test/phiX174.fna        contigs_greater_1k      1
test/phiX174.fna        percent_contigs_greater_1m      0.00
test/phiX174.fna        percent_contigs_greater_100k    0.00
test/phiX174.fna        percent_contigs_greater_10k     0.00
test/phiX174.fna        percent_contigs_greater_1k      100.00

gzip Compressed

assembly-scan includes a simple check (.gz extension) for gzip compressed assemblies. This example also demonstrates the --json option output.

assembly-scan test/saureus.fasta.gz --json
{
    "sample": "saureus.fasta.gz",
    "total_contig": 139,
    "total_contig_length": 2761520,
    "max_contig_length": 269921,
    "mean_contig_length": 19867,
    "median_contig_length": 163,
    "min_contig_length": 56,
    "n50_contig_length": 86756,
    "l50_contig_count": 9,
    "num_contig_non_acgtn": 0,
    "contig_percent_a": "33.74",
    "contig_percent_c": "16.50",
    "contig_percent_g": "16.21",
    "contig_percent_t": "33.54",
    "contig_percent_n": "0.00",
    "contig_non_acgtn": "0.00",
    "contigs_greater_1m": 0,
    "contigs_greater_100k": 7,
    "contigs_greater_10k": 37,
    "contigs_greater_1k": 49,
    "percent_contigs_greater_1m": "0.00",
    "percent_contigs_greater_100k": "5.04",
    "percent_contigs_greater_10k": "26.62",
    "percent_contigs_greater_1k": "35.25"
}

Output Columns

Column Description
sample Either assembly file basename, or value of --prefix
total_contig Total number of contigs in the assembly
total_contig_length Sum of all contig lengths
max_contig_length Length of the longest contig
mean_contig_length Average length of all contigs
median_contig_length Median value of all contigs
min_contig_length Length of the smallest contig
n50_contig_length N50 length of the contigs
l50_contig_count L50 number of contigs make up half the total
num_contig_non_acgtn Number of contigs with non-A,T,G,C, or N characters
contig_percent_a Percent of A nucleotides in contigs
contig_percent_c Percent of C nucleotides in contigs
contig_percent_g Percent of G nucleotides in contigs
contig_percent_t Percent of T nucleotides in contigs
contig_percent_n Percent of N nucleotides in contigs
contig_non_acgtn Percent of non-A,T,G,C, or N nucleotides in contigs
contigs_greater_1m Number of contigs greater than 1,000,000 bp
contigs_greater_100k Number of contigs greater than 100,000 bp
contigs_greater_10k Number of contigs greater than 10,000 bp
contigs_greater_1k Number of contigs greater than 1,000 bp
percent_contigs_greater_1m Percent of contigs greater than 1,000,000 bp
percent_contigs_greater_100k Percent of contigs greater than 1,000,000 bp
percent_contigs_greater_10k Percent of contigs greater than 1,000,000 bp
percent_contigs_greater_1k Percent of contigs greater than 1,000,000 bp

Naming

Originally this was named assembly-stats, but after a quick Google search (which I didn't do, again, I really should do better!) I found another assembly-stats from Sanger Pathogens. So I decided to rename it to assembly-scan, similar to my fastq-scan tool, since this process is similar to the Scan ability found in some video games/movies/tv etc... In otherwords, it 'scans' an assembly and provides the user with otherwise hidden information about the assembly.

About

Generate basic stats for an assembly.

Resources

License

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published

Languages