ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa

Overview

ProteomeLM is a transformer-based language model that reasons on entire proteomes from species spanning the tree of life. Unlike existing protein language models that operate on individual sequences, ProteomeLM learns contextualized protein representations by leveraging the functional constraints present at the proteome scale.

Key Contributions

Proteome-scale modeling: First language model to process entire proteomes across eukaryotes and prokaryotes, capturing inter-protein dependencies and functional constraints
Ultra-fast PPI screening: Screens whole interactomes orders of magnitude faster than classic coevolution-based methods, enabling proteome-wide interaction analysis
State-of-the-art performance: Achieves superior results on protein-protein interaction prediction across species and benchmarks through attention-based interaction detection
Gene essentiality prediction: Novel capability to predict essential genes generalizing across diverse taxa
Attention-based insights: Spontaneously captures protein-protein interactions in attention coefficients without explicit training on interaction data
Hierarchical learning: Leverages OrthoDB taxonomic hierarchy for structured representation learning across the tree of life

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/Bitbol-Lab/ProteomeLM.git
cd ProteomeLM

# Create and activate environment
python3 -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

🤗 Pre-trained Models

All ProteomeLM models are available on Hugging Face Hub. Choose the appropriate model size for your use case:

Model	Parameters	Size	Hugging Face	Description
ProteomeLM-XS	5.66M	11.3MB	`Bitbol-Lab/ProteomeLM-XS`	Ultra-lightweight for quick inference
ProteomeLM-S	36.9M	73.8MB	`Bitbol-Lab/ProteomeLM-S`	Small model balancing speed and accuracy
ProteomeLM-M	112M	225MB	`Bitbol-Lab/ProteomeLM-M`	Medium model for most applications (can't fit biggest proteomes)
ProteomeLM-L	328M	656MB	`Bitbol-Lab/ProteomeLM-L`	Large model for maximum performance (can fit biggest proteomes)

Training Dataset

The training dataset is also available on Hugging Face:

ProteomeLM-dataset: Preprocessed OrthoDB embeddings and hierarchical data

Repository Structure

ProteomeLM/
├── 📄 __init__.py                 # Package initialization
├── 📄 setup.py                    # Package setup script
├── 📋 requirements.txt            # Python dependencies
├── 📄 LICENSE                     # Apache 2.0 license
├── 📄 README.md                   # Project documentation
├── 📄 paper.pdf                   # Research paper
├── 🐳 Dockerfile                  # Container configuration
├── 📁 configs/                    # Training configuration files
│   └── proteomelm.yaml           # Base configuration
├── 📁 proteomelm/                # Core model implementation
│   ├── __init__.py              # Package initialization
│   ├── cli.py                   # Command-line interface
│   ├── config_manager.py        # Configuration management
│   ├── modeling_proteomelm.py    # ProteomeLM model architecture
│   ├── trainer.py               # Custom training logic
│   ├── train.py                 # Training functions
│   ├── dataloaders.py           # Data loading utilities
│   ├── encode_dataset.py        # Dataset encoding
│   ├── utils.py                 # Utility functions
│   └── ppi/                     # PPI-specific components
│       ├── __init__.py          # Package initialization
│       ├── config.py            # PPI configuration
│       ├── data_processing.py   # Data preprocessing
│       ├── evaluation.py        # Performance evaluation
│       ├── experiment_runner.py  # Experiment management
│       ├── feature_extraction.py # Feature engineering
│       ├── main.py              # Main PPI runner
│       ├── model.py             # PPI models
│       └── utils.py             # PPI utilities
├── 📁 experiments/              # Research experiments
│   ├── __init__.py              # Package initialization
│   ├── fast_orthodb_matching.py # Ortholog matching utilities
│   ├── nb_plots.ipynb           # Analysis notebook
│   └── interactomes/            # Interactome analysis
│       ├── human.ipynb          # Human interactome analysis
│       └── pathogens.ipynb      # Pathogen interactome analysis
├── 📁 notebooks/                # Analysis notebooks
│   ├── ppi_prediction.ipynb     # PPI prediction notebook
│   └── notebooks_utils.py       # Notebook utilities
├── 📁 weights/                  # Pre-trained model weights
│   ├── ProteomeLM-XS/           # Extra small model weights
│   ├── ProteomeLM-S/            # Small model weights
│   ├── ProteomeLM-M/            # Medium model weights
│   └── ProteomeLM-L/            # Large model weights
├── 📁 data/                     # Data storage
│   ├── interactomes/            # Interaction data
│   │   ├── logistic_regression_model_human.pkl
│   │   └── logistic_regression_model_pathogens.pkl
│   └── orthodb12_raw/           # OrthoDB raw data
│       ├── odb12v0_aa.fasta.gz  # Amino acid sequences
│       ├── odb12v0_OG2genes.tab # Gene-ortholog mapping
│       └── odb12v0_OG_pairs.tab # Ortholog pairs
└── 📁 img/                      # Documentation images
    └── main_fig.png             # Main figure

🔧 Usage

Quick Start: Fast PPI prediction

For interactive PPI prediction with multiple data sources, use our comprehensive Jupyter notebook:

# Launch the interactive PPI prediction notebook
jupyter notebook notebooks/ppi_prediction.ipynb

Open Notebook

The notebook provides a flexible framework supporting:

Data Sources:

Local FASTA files: Upload your own protein sequences
STRING database: Download sequences by organism ID (e.g., "9606" for human)
UniProt database: Download sequences by taxon ID
UniProt IDs: Fetch specific protein sequences by accession

Key Features:

Automated ProteomeLM feature extraction using attention mechanisms
Pre-trained logistic regression models for PPI prediction
STRING annotation comparison and evaluation
Comprehensive visualization and analysis

Gene Essentiality Prediction

TODO

Training ProteomeLM

Train a new model from scratch or fine-tune existing weights:

# Using the CLI interface
python -m proteomelm.cli train --config configs/proteomelm.yaml

# Multi-GPU distributed training
torchrun --nproc_per_node=4 -m proteomelm.cli train \
    --config configs/proteomelm.yaml \
    --distributed

# Fine-tune from Hugging Face model
python -m proteomelm.cli train --config configs/proteomelm.yaml --pretrained Bitbol-Lab/ProteomeLM-M \

# Advanced training with custom parameters
python -m proteomelm.cli train --config configs/proteomelm.yaml

Docker Deployment

For containerized execution:

# Build container
docker build -t proteomelm:latest .

# Run training
docker run --gpus all -v $(pwd):/workspace proteomelm:latest \
    python train.py --config configs/proteomelm.yaml

Loading Models

# From Hugging Face Hub (recommended)
from proteomelm import ProteomeLMForMaskedLM

model_xs = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-XS")
model_s = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-S") 
model_m = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-M")
model_l = ProteomeLMForMaskedLM.from_pretrained("Bitbol-Lab/ProteomeLM-L")

# From local weights (after git clone)
model = ProteomeLMForMaskedLM.from_pretrained("weights/ProteomeLM-M")

Citation

If you use ProteomeLM in your research, please cite our paper:

@article{malbranke2025proteomelm,
  title={ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa},
  author={Malbranke, Cyril and Zalaffi, Gionata Paolo and Bitbol, Anne-Florence},
  journal={bioRxiv},
  pages={2025.08.01.668221},
  year={2025},
  publisher={Cold Spring Harbor Laboratory},
  doi={10.1101/2025.08.01.668221},
  url={https://www.biorxiv.org/content/10.1101/2025.08.01.668221v1}
}

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Acknowledgments

EvolutionaryScale team for developping ESM-C

Contact

Cyril Malbranke

🔗 Quick Links

⬆ Back to Top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa

Overview

Key Contributions

🚀 Quick Start

Installation

🤗 Pre-trained Models

Training Dataset

Repository Structure

🔧 Usage

Quick Start: Fast PPI prediction

Gene Essentiality Prediction

Training ProteomeLM

Docker Deployment

Loading Models

Citation

License

Acknowledgments

Contact

🔗 Quick Links

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
data/interactomes		data/interactomes
experiments		experiments
img		img
notebooks		notebooks
proteomelm		proteomelm
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
paper.pdf		paper.pdf
requirements.txt		requirements.txt
setup.py		setup.py

License

Bitbol-Lab/ProteomeLM

Folders and files

Latest commit

History

Repository files navigation

ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa

Overview

Key Contributions

🚀 Quick Start

Installation

🤗 Pre-trained Models

Training Dataset

Repository Structure

🔧 Usage

Quick Start: Fast PPI prediction

Gene Essentiality Prediction

Training ProteomeLM

Docker Deployment

Loading Models

Citation

License

Acknowledgments

Contact

🔗 Quick Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages