Skip to content

ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa

License

Notifications You must be signed in to change notification settings

Bitbol-Lab/ProteomeLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa

ProteomeLM Overview

Overview

ProteomeLM is a transformer-based language model that reasons on entire proteomes from species spanning the tree of life. Unlike existing protein language models that operate on individual sequences, ProteomeLM learns contextualized protein representations by leveraging the functional constraints present at the proteome scale.

Key Contributions

  • Proteome-scale modeling: First language model to process entire proteomes across eukaryotes and prokaryotes, capturing inter-protein dependencies and functional constraints
  • Ultra-fast PPI screening: Screens whole interactomes orders of magnitude faster than classic coevolution-based methods, enabling proteome-wide interaction analysis
  • State-of-the-art performance: Achieves superior results on protein-protein interaction prediction across species and benchmarks through attention-based interaction detection
  • Gene essentiality prediction: Novel capability to predict essential genes generalizing across diverse taxa
  • Attention-based insights: Spontaneously captures protein-protein interactions in attention coefficients without explicit training on interaction data
  • Hierarchical learning: Leverages OrthoDB taxonomic hierarchy for structured representation learning across the tree of life

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/Bitbol-Lab/ProteomeLM.git
cd ProteomeLM

# Create and activate environment
python3 -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

πŸ€— Pre-trained Models

All ProteomeLM models are available on Hugging Face Hub. Choose the appropriate model size for your use case:

Model Parameters Size Hugging Face Description
ProteomeLM-XS 5.66M 11.3MB Bitbol-Lab/ProteomeLM-XS Ultra-lightweight for quick inference
ProteomeLM-S 36.9M 73.8MB Bitbol-Lab/ProteomeLM-S Small model balancing speed and accuracy
ProteomeLM-M 112M 225MB Bitbol-Lab/ProteomeLM-M Medium model for most applications (can't fit biggest proteomes)
ProteomeLM-L 328M 656MB Bitbol-Lab/ProteomeLM-L Large model for maximum performance (can fit biggest proteomes)

Training Dataset

The training dataset is also available on Hugging Face:

Repository Structure

ProteomeLM/
β”œβ”€β”€ πŸ“„ __init__.py                 # Package initialization
β”œβ”€β”€ πŸ“„ setup.py                    # Package setup script
β”œβ”€β”€ πŸ“‹ requirements.txt            # Python dependencies
β”œβ”€β”€ πŸ“„ LICENSE                     # Apache 2.0 license
β”œβ”€β”€ πŸ“„ README.md                   # Project documentation
β”œβ”€β”€ πŸ“„ paper.pdf                   # Research paper
β”œβ”€β”€ 🐳 Dockerfile                  # Container configuration
β”œβ”€β”€ πŸ“ configs/                    # Training configuration files
β”‚   └── proteomelm.yaml           # Base configuration
β”œβ”€β”€ πŸ“ proteomelm/                # Core model implementation
β”‚   β”œβ”€β”€ __init__.py              # Package initialization
β”‚   β”œβ”€β”€ cli.py                   # Command-line interface
β”‚   β”œβ”€β”€ config_manager.py        # Configuration management
β”‚   β”œβ”€β”€ modeling_proteomelm.py    # ProteomeLM model architecture
β”‚   β”œβ”€β”€ trainer.py               # Custom training logic
β”‚   β”œβ”€β”€ train.py                 # Training functions
β”‚   β”œβ”€β”€ dataloaders.py           # Data loading utilities
β”‚   β”œβ”€β”€ encode_dataset.py        # Dataset encoding
β”‚   β”œβ”€β”€ utils.py                 # Utility functions
β”‚   └── ppi/                     # PPI-specific components
β”‚       β”œβ”€β”€ __init__.py          # Package initialization
β”‚       β”œβ”€β”€ config.py            # PPI configuration
β”‚       β”œβ”€β”€ data_processing.py   # Data preprocessing
β”‚       β”œβ”€β”€ evaluation.py        # Performance evaluation
β”‚       β”œβ”€β”€ experiment_runner.py  # Experiment management
β”‚       β”œβ”€β”€ feature_extraction.py # Feature engineering
β”‚       β”œβ”€β”€ main.py              # Main PPI runner
β”‚       β”œβ”€β”€ model.py             # PPI models
β”‚       └── utils.py             # PPI utilities
β”œβ”€β”€ πŸ“ experiments/              # Research experiments
β”‚   β”œβ”€β”€ __init__.py              # Package initialization
β”‚   β”œβ”€β”€ fast_orthodb_matching.py # Ortholog matching utilities
β”‚   β”œβ”€β”€ nb_plots.ipynb           # Analysis notebook
β”‚   └── interactomes/            # Interactome analysis
β”‚       β”œβ”€β”€ human.ipynb          # Human interactome analysis
β”‚       └── pathogens.ipynb      # Pathogen interactome analysis
β”œβ”€β”€ πŸ“ notebooks/                # Analysis notebooks
β”‚   β”œβ”€β”€ ppi_prediction.ipynb     # PPI prediction notebook
β”‚   └── notebooks_utils.py       # Notebook utilities
β”œβ”€β”€ πŸ“ weights/                  # Pre-trained model weights
β”‚   β”œβ”€β”€ ProteomeLM-XS/           # Extra small model weights
β”‚   β”œβ”€β”€ ProteomeLM-S/            # Small model weights
β”‚   β”œβ”€β”€ ProteomeLM-M/            # Medium model weights
β”‚   └── ProteomeLM-L/            # Large model weights
β”œβ”€β”€ πŸ“ data/                     # Data storage
β”‚   β”œβ”€β”€ interactomes/            # Interaction data
β”‚   β”‚   β”œβ”€β”€ logistic_regression_model_human.pkl
β”‚   β”‚   └── logistic_regression_model_pathogens.pkl
β”‚   └── orthodb12_raw/           # OrthoDB raw data
β”‚       β”œβ”€β”€ odb12v0_aa.fasta.gz  # Amino acid sequences
β”‚       β”œβ”€β”€ odb12v0_OG2genes.tab # Gene-ortholog mapping
β”‚       └── odb12v0_OG_pairs.tab # Ortholog pairs
└── πŸ“ img/                      # Documentation images
    └── main_fig.png             # Main figure

πŸ”§ Usage

Quick Start: Fast PPI prediction

For interactive PPI prediction with multiple data sources, use our comprehensive Jupyter notebook:

# Launch the interactive PPI prediction notebook
jupyter notebook notebooks/ppi_prediction.ipynb

Open Notebook

The notebook provides a flexible framework supporting:

Data Sources:

  • Local FASTA files: Upload your own protein sequences
  • STRING database: Download sequences by organism ID (e.g., "9606" for human)
  • UniProt database: Download sequences by taxon ID
  • UniProt IDs: Fetch specific protein sequences by accession

Key Features:

  • Automated ProteomeLM feature extraction using attention mechanisms
  • Pre-trained logistic regression models for PPI prediction
  • STRING annotation comparison and evaluation
  • Comprehensive visualization and analysis

Gene Essentiality Prediction

TODO

Training ProteomeLM

Train a new model from scratch or fine-tune existing weights:

# Using the CLI interface
python -m proteomelm.cli train --config configs/proteomelm.yaml

# Multi-GPU distributed training
torchrun --nproc_per_node=4 -m proteomelm.cli train \
    --config configs/proteomelm.yaml \
    --distributed

# Fine-tune from Hugging Face model
python -m proteomelm.cli train --config configs/proteomelm.yaml --pretrained Bitbol-Lab/ProteomeLM-M \

# Advanced training with custom parameters
python -m proteomelm.cli train --config configs/proteomelm.yaml

Docker Deployment

For containerized execution:

# Build container
docker build -t proteomelm:latest .

# Run training
docker run --gpus all -v $(pwd):/workspace proteomelm:latest \
    python train.py --config configs/proteomelm.yaml

Loading Models

# From Hugging Face Hub (recommended)
from proteomelm import ProteomeLMForMaskedLM

model_xs = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-XS")
model_s = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-S") 
model_m = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-M")
model_l = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-L")

# From local weights (after git clone)
model = ProteomeLMForMaskedLM.from_pretrained("weights/ProteomeLM-M")

Citation

If you use ProteomeLM in your research, please cite our paper:

@article{malbranke2025proteomelm,
  title={ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa},
  author={Malbranke, Cyril and Zalaffi, Gionata Paolo and Bitbol, Anne-Florence},
  journal={bioRxiv},
  pages={2025.08.01.668221},
  year={2025},
  publisher={Cold Spring Harbor Laboratory},
  doi={10.1101/2025.08.01.668221},
  url={https://www.biorxiv.org/content/10.1101/2025.08.01.668221v1}
}

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Acknowledgments

Contact

Cyril Malbranke

πŸ”— Quick Links


About

ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa

Resources

License

Stars

Watchers

Forks

Packages

No packages published