Skip to content

AlanRockefeller/fixfasta.py

Repository files navigation

fixfasta.py

Version 1.2 - January 10, 2026

By Alan Rockefeller

A tool for automatically fixing the orientation of fungal ITS sequences in FASTA files. Perfect for cleaning up sequences downloaded from databases like MycoMap or GenBank where some sequences might be reverse-complemented.

What it does

Ever downloaded a bunch of ITS sequences only to find that some are in the wrong orientation, making your phylogenetic tree completely useless until you manually reverse-complement them? This tool automatically detects and fixes that problem by looking for conserved motifs in the ITS region. It flips sequences that are backwards, giving you a clean FASTA file ready for phylogenetic analysis.

Features

  • Smart orientation detection using three conserved ITS motifs (ITS1-F, 5.8S core, ITS4)
  • Conservative Reversal Gate: Sequences are only reversed if there is strong evidence (core + flank presence and high-quality core hit)
  • IUPAC-aware fuzzy matching - handles ambiguous nucleotides
  • Robustness: 'N' and '-' are treated as mismatches to prevent spurious hits in low-quality regions
  • Automatic repair of malformed FASTA files with stray '>' symbols
  • Detailed reporting of which sequences were reversed (or silent operation with -q)
  • Fast and efficient - processes large files quickly using a high-performance sliding window scan
  • Cross-platform consistent: Ensures UTF-8 encoding and Unix line endings (\n) for all output

Installation

Just download the script and make it executable:

wget https://raw.githubusercontent.com/AlanRockefeller/fixfasta/main/fixfasta.py
chmod +x fixfasta.py

Requirements:

  • Python 3.7+
  • No external dependencies - uses only Python standard library

Quick Start

Basic usage - fix orientation and see what was changed:

./fixfasta.py sequences.fasta > fixed_sequences.fasta

The script will report which sequences it reversed:

=== Reversed sequences (2) ===
  DQ422012_Russula_ochrospora
  iNat180216325_Russula_sp

Usage Examples

Process multiple files:

cat *.fasta | ./fixfasta.py > all_fixed.fasta

Silent mode (no reports):

./fixfasta.py input.fasta -q > output.fasta

See detailed statistics:

./fixfasta.py input.fasta --stats > output.fasta

Verbose mode to understand decisions:

./fixfasta.py input.fasta -v > output.fasta

Dry run (analyze without modifying):

./fixfasta.py input.fasta --dry-run

Save to a specific output file:

./fixfasta.py input.fasta -o output.fasta

Command Line Options

-h, --help            Show help message
-o, --output          Output file (default: stdout)
-n, --dry-run         Don\'t write output, just analyze
-v, --verbose         Verbose output showing decision process
-s, --stats           Print orientation statistics
--stats-only          Only print statistics, no sequence output
-q, --quiet           Suppress all warnings and reports
--max-mismatches N    Maximum mismatches per motif (default: 4).
                      Note: Only substitutions are counted (no indels).

How It Works

The tool uses three conserved motifs commonly found in fungal ITS sequences:

  1. ITS1-F (TCCGTAGGTGAACCTGCGG) - found at the 18S end
  2. 5.8S core (GCATCGATGAAGAACGCAGC) - middle region
  3. ITS4 (TCCTCCGCTTATTGATATGC) - found at the 28S start

Orientation Logic

For each sequence, it:

  1. Searches for these motifs in both orientations (forward and reverse-complement).
  2. Uses a fast sliding window to count mismatches (substitutions only).
  3. Primary Winner:
    • Which orientation has more distinct motif hits.
    • If tied, which has fewer total mismatches.
    • If still tied, which has the earliest best hit.
  4. Conservative Reversal Gate: Even if "reverse" is the primary winner, the sequence is only reversed if:
    • It contains a hit for the 5.8S core AND at least one flanking motif (ITS1-F or ITS4).
    • The 5.8S core hit in the reverse orientation is high quality (≤ 2 mismatches).
    • The motifs appear in the correct relative order (ITS4-rev before core, ITS1F-rev after core).
  5. If the reversal gate fails, it stays "forward" (or "uncertain" if no hits at all were found).

The fuzzy matching understands IUPAC ambiguity codes (R, Y, S, W, K, M, etc.). Note that 'N' and '-' are always treated as mismatches to avoid false positives in low-quality sequence data.

Real-World Use Case

This tool was originally created to process ITS sequences downloaded from MycoMap for phylogenetic tree construction. It's particularly useful when combining sequences from multiple sources where orientation consistency isn't guaranteed.

Example workflow:

# Download sequences from MycoMap
# ... download process ...

# Fix orientations
./fixfasta.py mycomap.fa ncbi.fa > sequences_oriented.fas

# Now ready for MAFFT alignment, RAxML, etc.
mafft sequences_oriented.fasta > aligned.fasta

Tips

  • The default behavior shows you what was changed - use -q for silent operation in pipelines.
  • Use --dry-run first on new datasets to see what would be changed.
  • The tool preserves sequence names exactly.
  • Handles messy FASTA files gracefully (like those with stray '>' symbols).
  • All diagnostic output goes to stderr, so stdout piping remains clean.

getfasta.py

Version 1.0 – June 30, 2025

A tiny helper that grabs the FASTA files behind a MycoMap BLAST result page – no clicking, no copy‑and‑paste, just the files on disk ready to go. As of v1.1 it also tells you how many sequences were retrieved.

What it does

  • Pulls the NCBI‑side FASTA (ncbi_<ID>.fasta) and the local MycoMap FASTA (myco_<ID>.fasta) for a given BLAST job.
  • Prints the download time, file size and sequence count for each file.
  • Names everything with the BLAST numeric ID.

Example

python getfasta.py https://mycomap.com/genetics/blast-search/a04-inat237420128-1-ric77-332392-r265167/

Output looks like this:

Downloading FASTA files for MycoBLAST ID: 265167
NCBI downloaded in 2.85s (74953 bytes, 97 sequences)
MycoBLAST downloaded in 1.02s (36161 bytes, 50 sequences)

License

This project is licensed under the MIT License:

MIT License

Copyright (c) 2026 Alan Rockefeller

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Contributing

Found a bug? Have a suggestion? Feel free to open an issue or submit a pull request!

https://github.com/AlanRockefeller/fixfasta.py

Acknowledgments

Thanks to the mycological community for providing the data that made this tool necessary, and to everyone who's contributed sequences to public databases.

About

Reverse complement backwards ITS sequences in FASTA files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages