FastPLMs is an open-source initiative dedicated to accelerating pretrained protein language models (pLMs). By replacing native, often suboptimal attention implementations with Flash Attention or Flex Attention, we provide high-performance alternatives that are fully compatible with the HuggingFace transformers ecosystem.
- Introduction
- Supported Models
- Efficiency: Flex & Flash Attention
- Embedding & Pooling
- Concrete Examples
- Testing & Benchmarking
- Installation & Docker
Protein Language Models are transformer-based architectures trained on massive datasets of protein sequences (such as UniProt). These models learn the "grammar" of proteins, capturing evolutionary information, structural constraints, and functional motifs. They are used for:
- Representation Learning: Generating high-dimensional embeddings for downstream tasks (e.g., stability, function prediction).
- Protein Generation: Designing novel sequences with specific properties.
- Structure Prediction: Mapping sequences to their 3D folds (e.g., Boltz2).
FastPLMs provides optimized versions of these models. Our focus is on:
- Speed: Drastically faster inference through optimized attention kernels.
- Memory Efficiency: Lower VRAM usage, enabling larger batch sizes or longer sequences.
- Seamless Integration: Use
AutoModel.from_pretrained(..., trust_remote_code=True)to load our optimized weights directly from HuggingFace.
We maintain a comprehensive HuggingFace Collection of optimized models. Below is a summary of the supported families and their origins.
| Model Family | Organization | Official Implementation | FastPLMs Optimization | Checkpoints |
|---|---|---|---|---|
| E1 | Profluent Bio | Profluent-Bio/E1 | Flex Attention, Block-Causal | 150M, 300M, 600M |
| ESM2 | Meta AI | facebookresearch/esm | Flash (SDPA) / Flex Attention | 8M, 35M, 150M, 650M, 3B |
| ESM++ | EvolutionaryScale | EvolutionaryScale/esm | Optimized SDPA / Flex | Small (300M), Large (600M) |
| DPLM | ByteDance | bytedance/dplm | Diffusion Optimized Attention | 150M, 650M, 3B |
| DPLM2 | ByteDance | bytedance/dplm | Multimodal Diffusion | 150M, 650M, 3B |
| Boltz2 | MIT / Various | jwohlwend/boltz | Optimized Structure Prediction | Standard |
We use PyTorch's Scaled Dot Product Attention (SDPA) as the default backend for most models. It provides significant speedups over native implementations while maintaining stability across different GPU architectures.
Flex Attention is a cutting-edge mechanism in PyTorch that allows for:
- Dynamic Masking: Efficiently ignoring padding without redundant compute.
- Custom Patterns: Supporting specialized masks (like E1's block-causal) with native performance.
- Extreme Speed: When combined with
torch.compile, Flex Attention often provides the best possible throughput.
To enable Flex Attention:
from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("Synthyra/ESMplusplus_small", trust_remote_code=True)
config.attn_backend = "flex"
model = AutoModel.from_pretrained("Synthyra/ESMplusplus_small", config=config, trust_remote_code=True)The EmbeddingMixin (shared across all models) provides a standardized way to extract representations from proteins.
The Pooler class aggregates sequence-level residue representations into a single fixed-size vector. Supported strategies include:
mean: Mask-aware average of all residues.cls: The first token's representation (Standard for classification).max: Element-wise maximum across the sequence.var/std: Variance or Standard Deviation of representations.norm: L2 normalization.median: Element-wise median.parti: Experimental PageRank-based attention pooling.
Ideal for embedding millions of sequences where you need to stream data or avoid OOM on RAM.
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("Synthyra/ESM2-150M", trust_remote_code=True).cuda()
sequences = ["MALWMRLLPLLALLALWGPDPAAA", "MKTIIALSYIFCLVFA", ...]
# Embed and store in SQLite
model.embed_dataset(
sequences=sequences,
batch_size=64,
pooling_types=['mean', 'cls'], # Concatenates both
sql=True,
sql_db_path='large_protein_db.db',
embed_dtype=torch.float32
)Perfect for medium-sized datasets that fit in memory.
# Embed and return as a dictionary
embeddings = model.embed_dataset(
sequences=sequences,
batch_size=128,
pooling_types=['mean'],
save=True,
save_path='my_embeddings.pth'
)
# Access embedding
seq_vector = embeddings["MALWMRLLPLLALLALWGPDPAAA"] # torch.TensorConcatenate multiple mathematical representations for richer downstream features.
# Use a variety of pooling types
embeddings = model.embed_dataset(
sequences=sequences,
pooling_types=['mean', 'max', 'std', 'var'], # All 4 concatenated
batch_size=32,
full_embeddings=False
)
# Resulting vector size: 4 * hidden_size
print(embeddings[sequences[0]].shape)FastPLMs includes a robust CLI-based testing suite under testing/.
- Compliance Checks: Verify that optimized models match reference outputs.
py -m testing.run_compliance --families esm2
- Throughput Benchmarks: Measure tokens/sec and peak memory.
py -m testing.run_throughput --device cuda --lengths 512,1024
- Run Everything: Execute the full suite across all families.
py -m testing.run_all --full-models
Results are saved to testing/results/<timestamp>/ as metrics.json, metrics.csv, and high-resolution plots.
git clone https://github.com/Synthyra/FastPLMs.git
cd FastPLMs
pip install -r requirements.txt# Build the image
docker build -t fastplms-test -f Dockerfile .
# Run benchmarks inside container
docker run --rm --gpus all -it -v ${PWD}:/workspace fastplms-test \
python -m testing.run_throughput --device cudaFound a bug or have a feature request? Please open a GitHub Issue. We are actively looking for contributions to optimize more pLM architectures!