Foldtree_ProstT5

⚠️ CAUTION: This pipeline has not been benchmarked yet. Use at your own risk and validate results carefully.

A Snakemake pipeline that provides a Foldtree replacement for phylogenetic tree construction when protein structures are not available. This pipeline leverages ProstT5 embeddings through Foldseek to generate statistically corrected and rooted sequence identity trees.

Overview

Foldtree_ProstT5 is designed for scenarios where:

Protein structures are unavailable for your sequences of interest
You need phylogenetic trees based on structural similarity estimates
Traditional Foldtree cannot be used due to lack of structural data

The pipeline operates in "fident mode" (sequence identity mode) and provides:

✅ Statistically corrected sequence identity trees
✅ Rooted phylogenetic trees
❌ Does NOT output LDDT distance matrices
❌ Does NOT output TM-score distance matrices

Prerequisites

Required Software

Snakemake (≥7.0)
Foldseek
Conda/Mamba for environment management

Required Data

You must download the ProstT5 weights before running this pipeline:

# Download ProstT5 weights using Foldseek
foldseek databases ProstT5 weights tmp

Important: Ensure you have sufficient disk space (~XX GB) for the ProstT5 database.

Quick Start

1. Installation

# Clone the repository
git clone https://github.com/DessimozLab/Foldtree_ProstT5.git
cd Foldtree_ProstT5

# Install Snakemake (if not already installed)
conda install -c bioconda snakemake

# Set up the project structure
make setup

2. Download ProstT5 Weights

# This may take several hours and requires significant disk space
foldseek databases ProstT5 data/prostT5_weights tmp

make sure to set the correct path in the config file at workflow/config/config_vars.yaml:

prostt5_weights: /path/to/your/Foldtree_ProstT5/data/prostT5_weights

3. Prepare Input Data

Place your protein sequences in FASTA format in data/sequences/:

cp your_sequences.fasta data/sequences/

5. Run the Pipeline

#run with Snakemake
snakemake -s workflow/rules/fold_tree_prostT5 --use-conda --cores 4 --config folder=./data/sequences

Pipeline Workflow

Sequence Preprocessing: Validates and filters input sequences
ProstT5 Embedding: Generates structural embeddings using Foldseek + ProstT5
Distance Calculation: Computes sequence identity from embeddings
Statistical Correction: Applies evolutionary distance corrections
Tree Construction: Builds initial phylogenetic tree from distance matrix
Tree Rooting: Roots the tree using specified method

Limitations & Important Notes

⚠️ Critical Limitations:

No LDDT matrices: This pipeline cannot output LDDT-based distance matrices
No TM-score matrices: TM-score calculations are not supported
Fident mode only: Only operates in sequence identity mode, not structural similarity mode
Requires ProstT5 weights: Must download large database files (~XX GB)
Not benchmarked: Results should be validated against known phylogenies when possible

Cluster Execution

# For SLURM clusters
snakemake --cluster-config config/cluster_config.yaml \
          --cluster "sbatch --partition=normal --time=4:00:00" \
          --jobs 20 --use-conda --directory #your dataset path#

Troubleshooting

Common Issues

ProstT5 database missing: Ensure you've downloaded the weights using foldseek databases
Memory errors: Increase memory allocation in cluster config for large datasets
Slow performance: Use more cores or consider splitting large sequence sets

Citation

If you use this pipeline, please cite:

Foldseek: [Steinegger & Söding, 2022]
ProstT5: [Heinzinger et al., 2023]
Foldtree : [Moi et al., 2025]

Contributing

This pipeline is under active development. Please report issues or contribute improvements via GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
config		config
esmologs @ 0e93775		esmologs @ 0e93775
madroot		madroot
src		src
testdata		testdata
workflow		workflow
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
logo.png		logo.png
logo.svg		logo.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Foldtree_ProstT5

Overview

Prerequisites

Required Software

Required Data

Quick Start

1. Installation

2. Download ProstT5 Weights

3. Prepare Input Data

5. Run the Pipeline

Pipeline Workflow

Limitations & Important Notes

Cluster Execution

Troubleshooting

Common Issues

Citation

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

DessimozLab/Foldtree_ProstT5

Folders and files

Latest commit

History

Repository files navigation

Foldtree_ProstT5

Overview

Prerequisites

Required Software

Required Data

Quick Start

1. Installation

2. Download ProstT5 Weights

3. Prepare Input Data

5. Run the Pipeline

Pipeline Workflow

Limitations & Important Notes

Cluster Execution

Troubleshooting

Common Issues

Citation

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages