Skip to content

DessimozLab/Foldtree_ProstT5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Foldtree_ProstT5

logo ⚠️ CAUTION: This pipeline has not been benchmarked yet. Use at your own risk and validate results carefully.

A Snakemake pipeline that provides a Foldtree replacement for phylogenetic tree construction when protein structures are not available. This pipeline leverages ProstT5 embeddings through Foldseek to generate statistically corrected and rooted sequence identity trees.

Overview

Foldtree_ProstT5 is designed for scenarios where:

  • Protein structures are unavailable for your sequences of interest
  • You need phylogenetic trees based on structural similarity estimates
  • Traditional Foldtree cannot be used due to lack of structural data

The pipeline operates in "fident mode" (sequence identity mode) and provides:

  • ✅ Statistically corrected sequence identity trees
  • ✅ Rooted phylogenetic trees
  • Does NOT output LDDT distance matrices
  • Does NOT output TM-score distance matrices

Prerequisites

Required Software

Required Data

You must download the ProstT5 weights before running this pipeline:

# Download ProstT5 weights using Foldseek
foldseek databases ProstT5 weights tmp

Important: Ensure you have sufficient disk space (~XX GB) for the ProstT5 database.

Quick Start

1. Installation

# Clone the repository
git clone https://github.com/DessimozLab/Foldtree_ProstT5.git
cd Foldtree_ProstT5

# Install Snakemake (if not already installed)
conda install -c bioconda snakemake

# Set up the project structure
make setup

2. Download ProstT5 Weights

# This may take several hours and requires significant disk space
foldseek databases ProstT5 data/prostT5_weights tmp

make sure to set the correct path in the config file at workflow/config/config_vars.yaml:

prostt5_weights: /path/to/your/Foldtree_ProstT5/data/prostT5_weights

3. Prepare Input Data

Place your protein sequences in FASTA format in data/sequences/:

cp your_sequences.fasta data/sequences/

5. Run the Pipeline

#run with Snakemake
snakemake -s workflow/rules/fold_tree_prostT5 --use-conda --cores 4 --config folder=./data/sequences 

Pipeline Workflow

  1. Sequence Preprocessing: Validates and filters input sequences
  2. ProstT5 Embedding: Generates structural embeddings using Foldseek + ProstT5
  3. Distance Calculation: Computes sequence identity from embeddings
  4. Statistical Correction: Applies evolutionary distance corrections
  5. Tree Construction: Builds initial phylogenetic tree from distance matrix
  6. Tree Rooting: Roots the tree using specified method

Limitations & Important Notes

⚠️ Critical Limitations:

  • No LDDT matrices: This pipeline cannot output LDDT-based distance matrices
  • No TM-score matrices: TM-score calculations are not supported
  • Fident mode only: Only operates in sequence identity mode, not structural similarity mode
  • Requires ProstT5 weights: Must download large database files (~XX GB)
  • Not benchmarked: Results should be validated against known phylogenies when possible

Cluster Execution

# For SLURM clusters
snakemake --cluster-config config/cluster_config.yaml \
          --cluster "sbatch --partition=normal --time=4:00:00" \
          --jobs 20 --use-conda --directory #your dataset path#

Troubleshooting

Common Issues

  1. ProstT5 database missing: Ensure you've downloaded the weights using foldseek databases
  2. Memory errors: Increase memory allocation in cluster config for large datasets
  3. Slow performance: Use more cores or consider splitting large sequence sets

Citation

If you use this pipeline, please cite:

  • Foldseek: [Steinegger & Söding, 2022]
  • ProstT5: [Heinzinger et al., 2023]
  • Foldtree : [Moi et al., 2025]

Contributing

This pipeline is under active development. Please report issues or contribute improvements via GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Foldtree using ProstT5 to run comparisons. Not benchmarked. Use at your own risk

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages