ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa
ProteomeLM is a transformer-based language model that reasons on entire proteomes from species spanning the tree of life. Unlike existing protein language models that operate on individual sequences, ProteomeLM learns contextualized protein representations by leveraging the functional constraints present at the proteome scale.
- Proteome-scale modeling: First language model to process entire proteomes across eukaryotes and prokaryotes, capturing inter-protein dependencies and functional constraints
- Ultra-fast PPI screening: Screens whole interactomes orders of magnitude faster than classic coevolution-based methods, enabling proteome-wide interaction analysis
- State-of-the-art performance: Achieves superior results on protein-protein interaction prediction across species and benchmarks through attention-based interaction detection
- Gene essentiality prediction: Novel capability to predict essential genes generalizing across diverse taxa
- Attention-based insights: Spontaneously captures protein-protein interactions in attention coefficients without explicit training on interaction data
- Hierarchical learning: Leverages OrthoDB taxonomic hierarchy for structured representation learning across the tree of life
# Clone the repository
git clone https://github.com/Bitbol-Lab/ProteomeLM.git
cd ProteomeLM
# Create and activate environment
python3 -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtAll ProteomeLM models are available on Hugging Face Hub. Choose the appropriate model size for your use case:
| Model | Parameters | Size | Hugging Face | Description |
|---|---|---|---|---|
| ProteomeLM-XS | 5.66M | 11.3MB | Bitbol-Lab/ProteomeLM-XS |
Ultra-lightweight for quick inference |
| ProteomeLM-S | 36.9M | 73.8MB | Bitbol-Lab/ProteomeLM-S |
Small model balancing speed and accuracy |
| ProteomeLM-M | 112M | 225MB | Bitbol-Lab/ProteomeLM-M |
Medium model for most applications (can't fit biggest proteomes) |
| ProteomeLM-L | 328M | 656MB | Bitbol-Lab/ProteomeLM-L |
Large model for maximum performance (can fit biggest proteomes) |
The training dataset is also available on Hugging Face:
- ProteomeLM-dataset: Preprocessed OrthoDB embeddings and hierarchical data
ProteomeLM/
βββ π __init__.py # Package initialization
βββ π setup.py # Package setup script
βββ π requirements.txt # Python dependencies
βββ π LICENSE # Apache 2.0 license
βββ π README.md # Project documentation
βββ π paper.pdf # Research paper
βββ π³ Dockerfile # Container configuration
βββ π configs/ # Training configuration files
β βββ proteomelm.yaml # Base configuration
βββ π proteomelm/ # Core model implementation
β βββ __init__.py # Package initialization
β βββ cli.py # Command-line interface
β βββ config_manager.py # Configuration management
β βββ modeling_proteomelm.py # ProteomeLM model architecture
β βββ trainer.py # Custom training logic
β βββ train.py # Training functions
β βββ dataloaders.py # Data loading utilities
β βββ encode_dataset.py # Dataset encoding
β βββ utils.py # Utility functions
β βββ ppi/ # PPI-specific components
β βββ __init__.py # Package initialization
β βββ config.py # PPI configuration
β βββ data_processing.py # Data preprocessing
β βββ evaluation.py # Performance evaluation
β βββ experiment_runner.py # Experiment management
β βββ feature_extraction.py # Feature engineering
β βββ main.py # Main PPI runner
β βββ model.py # PPI models
β βββ utils.py # PPI utilities
βββ π experiments/ # Research experiments
β βββ __init__.py # Package initialization
β βββ fast_orthodb_matching.py # Ortholog matching utilities
β βββ nb_plots.ipynb # Analysis notebook
β βββ interactomes/ # Interactome analysis
β βββ human.ipynb # Human interactome analysis
β βββ pathogens.ipynb # Pathogen interactome analysis
βββ π notebooks/ # Analysis notebooks
β βββ ppi_prediction.ipynb # PPI prediction notebook
β βββ notebooks_utils.py # Notebook utilities
βββ π weights/ # Pre-trained model weights
β βββ ProteomeLM-XS/ # Extra small model weights
β βββ ProteomeLM-S/ # Small model weights
β βββ ProteomeLM-M/ # Medium model weights
β βββ ProteomeLM-L/ # Large model weights
βββ π data/ # Data storage
β βββ interactomes/ # Interaction data
β β βββ logistic_regression_model_human.pkl
β β βββ logistic_regression_model_pathogens.pkl
β βββ orthodb12_raw/ # OrthoDB raw data
β βββ odb12v0_aa.fasta.gz # Amino acid sequences
β βββ odb12v0_OG2genes.tab # Gene-ortholog mapping
β βββ odb12v0_OG_pairs.tab # Ortholog pairs
βββ π img/ # Documentation images
βββ main_fig.png # Main figure
For interactive PPI prediction with multiple data sources, use our comprehensive Jupyter notebook:
# Launch the interactive PPI prediction notebook
jupyter notebook notebooks/ppi_prediction.ipynbThe notebook provides a flexible framework supporting:
Data Sources:
- Local FASTA files: Upload your own protein sequences
- STRING database: Download sequences by organism ID (e.g., "9606" for human)
- UniProt database: Download sequences by taxon ID
- UniProt IDs: Fetch specific protein sequences by accession
Key Features:
- Automated ProteomeLM feature extraction using attention mechanisms
- Pre-trained logistic regression models for PPI prediction
- STRING annotation comparison and evaluation
- Comprehensive visualization and analysis
TODO
Train a new model from scratch or fine-tune existing weights:
# Using the CLI interface
python -m proteomelm.cli train --config configs/proteomelm.yaml
# Multi-GPU distributed training
torchrun --nproc_per_node=4 -m proteomelm.cli train \
--config configs/proteomelm.yaml \
--distributed
# Fine-tune from Hugging Face model
python -m proteomelm.cli train --config configs/proteomelm.yaml --pretrained Bitbol-Lab/ProteomeLM-M \
# Advanced training with custom parameters
python -m proteomelm.cli train --config configs/proteomelm.yamlFor containerized execution:
# Build container
docker build -t proteomelm:latest .
# Run training
docker run --gpus all -v $(pwd):/workspace proteomelm:latest \
python train.py --config configs/proteomelm.yaml# From Hugging Face Hub (recommended)
from proteomelm import ProteomeLMForMaskedLM
model_xs = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-XS")
model_s = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-S")
model_m = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-M")
model_l = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-L")
# From local weights (after git clone)
model = ProteomeLMForMaskedLM.from_pretrained("weights/ProteomeLM-M")If you use ProteomeLM in your research, please cite our paper:
@article{malbranke2025proteomelm,
title={ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa},
author={Malbranke, Cyril and Zalaffi, Gionata Paolo and Bitbol, Anne-Florence},
journal={bioRxiv},
pages={2025.08.01.668221},
year={2025},
publisher={Cold Spring Harbor Laboratory},
doi={10.1101/2025.08.01.668221},
url={https://www.biorxiv.org/content/10.1101/2025.08.01.668221v1}
}This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
- EvolutionaryScale team for developping ESM-C
- π Paper on bioRxiv
- π€ Model Collection
- π Training Dataset
- π» Source Code
- π Report Issues
