A Snakemake pipeline that provides a Foldtree replacement for phylogenetic tree construction when protein structures are not available. This pipeline leverages ProstT5 embeddings through Foldseek to generate statistically corrected and rooted sequence identity trees.
Foldtree_ProstT5 is designed for scenarios where:
- Protein structures are unavailable for your sequences of interest
- You need phylogenetic trees based on structural similarity estimates
- Traditional Foldtree cannot be used due to lack of structural data
The pipeline operates in "fident mode" (sequence identity mode) and provides:
- ✅ Statistically corrected sequence identity trees
- ✅ Rooted phylogenetic trees
- ❌ Does NOT output LDDT distance matrices
- ❌ Does NOT output TM-score distance matrices
- Snakemake (≥7.0)
- Foldseek
- Conda/Mamba for environment management
You must download the ProstT5 weights before running this pipeline:
# Download ProstT5 weights using Foldseek
foldseek databases ProstT5 weights tmpImportant: Ensure you have sufficient disk space (~XX GB) for the ProstT5 database.
# Clone the repository
git clone https://github.com/DessimozLab/Foldtree_ProstT5.git
cd Foldtree_ProstT5
# Install Snakemake (if not already installed)
conda install -c bioconda snakemake
# Set up the project structure
make setup# This may take several hours and requires significant disk space
foldseek databases ProstT5 data/prostT5_weights tmpmake sure to set the correct path in the config file at workflow/config/config_vars.yaml:
prostt5_weights: /path/to/your/Foldtree_ProstT5/data/prostT5_weightsPlace your protein sequences in FASTA format in data/sequences/:
cp your_sequences.fasta data/sequences/#run with Snakemake
snakemake -s workflow/rules/fold_tree_prostT5 --use-conda --cores 4 --config folder=./data/sequences - Sequence Preprocessing: Validates and filters input sequences
- ProstT5 Embedding: Generates structural embeddings using Foldseek + ProstT5
- Distance Calculation: Computes sequence identity from embeddings
- Statistical Correction: Applies evolutionary distance corrections
- Tree Construction: Builds initial phylogenetic tree from distance matrix
- Tree Rooting: Roots the tree using specified method
- No LDDT matrices: This pipeline cannot output LDDT-based distance matrices
- No TM-score matrices: TM-score calculations are not supported
- Fident mode only: Only operates in sequence identity mode, not structural similarity mode
- Requires ProstT5 weights: Must download large database files (~XX GB)
- Not benchmarked: Results should be validated against known phylogenies when possible
# For SLURM clusters
snakemake --cluster-config config/cluster_config.yaml \
--cluster "sbatch --partition=normal --time=4:00:00" \
--jobs 20 --use-conda --directory #your dataset path#- ProstT5 database missing: Ensure you've downloaded the weights using
foldseek databases - Memory errors: Increase memory allocation in cluster config for large datasets
- Slow performance: Use more cores or consider splitting large sequence sets
If you use this pipeline, please cite:
- Foldseek: [Steinegger & Söding, 2022]
- ProstT5: [Heinzinger et al., 2023]
- Foldtree : [Moi et al., 2025]
This pipeline is under active development. Please report issues or contribute improvements via GitHub.
This project is licensed under the MIT License. See the LICENSE file for details.