StructPhylogeny is a Snakemake pipeline for building parallel structure-based and sequence-based phylogenies from a folder of protein structures.
It is designed around the workflow direction of Lakshmi et al. (2015), but implemented with modern tooling:
GTalignfor all-vs-all 3D structural alignment and comparison- an SDM-like structural dissimilarity matrix computed from pairwise alignments
- a structure tree inferred from the SDM matrix
- sequence extraction directly from the input structures
MAFFTfor multiple sequence alignmentIQ-TREE 3for maximum-likelihood sequence phylogeny- summary QC, distance-matrix comparison, and heatmap/report outputs
- This package requires CUDA - it would be possible to run this without cuda but a different installation of GTalign would be needed - modify the mamba env based on the GTalign repository docs to use the CPU version
- This package was used and developed as part of my diploma thesis; no official publication regarding this exists as of now. If needed, cite the tools used here, such as GTalign, MAFFT, IQTREE3, etc.
The repository includes a mamba environment specification in environment.yml.
Create and activate it with:
mamba env create -f environment.yml
mamba activate structphylogenyPlace protein structures in a directory such as data/my_structures/.
Supported structure formats:
.cif.mmcif.pdb.ent
The current default dataset is:
data/mouse_lipocalin_structures
The default configuration is in config/config.yaml.
Run the full pipeline with:
snakemake --cores 8 # For example data
snakemake --cores 8 --configfile path/to/your/config.yaml # For your defined folder of proteins and flags for MAFFT or IQTREEOr dry-run it first:
snakemake --cores 8 --dry-runMain outputs land under results/:
results/manifest/structures.tsv: input manifest and basic QCresults/sequences/proteins.fasta: extracted sequencesresults/sequences/proteins.aligned.fasta: MAFFT alignmentresults/trees/sequence.treefile: IQ-TREE sequence phylogenyresults/structure/gtalign_pairwise.tsv: parsed GTalign pairwise metricsresults/structure/sdm.tsv: structural dissimilarity matrixresults/trees/structure_sdm.nwk: structure tree from the SDM matrixresults/reports/sdm_heatmap.png: SDM heatmapresults/reports/sequence_distance_heatmap.png: sequence p-distance heatmapresults/reports/matrix_correlation.tsv: agreement between structure and sequence distancesresults/reports/report.md: concise summary report
- The paper used DALI and a PHYLIP distance-tree method. This repository uses
GTalignfor structural comparison, as it more modern, faster and better tool. - The SDM implementation follows the Lakshmi et al. formula and computes PFTE from topologically equivalent residues as GTalign reports them.
IQ-TREE 3is used for the sequence phylogeny. The structure phylogeny is inferred from the SDM distance matrix with a neighbor-joining distance-tree method.- By default, if a structure has multiple chains, the pipeline uses the longest protein chain.
Run the lightweight unit tests with:
PYTHONPATH=src pytest