Official repository for ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibility and ProteinConformers: large-scale and energetically profiled descriptions of protein conformational landscapes.
ProteinConformers provides data loaders, sampling pipelines, and evaluation utilities for benchmarking generative models of protein conformations. The repository currently includes reference implementations for BioEmu- and ESMdiff-based samplers and a suite of downstream metrics covering free-energy estimation, population coverage, and structural plausibility.
To generate decoy structures, users must install and correctly configure the environments for AlphaFlow, BioEmu, ESMdiff, AFsample2, and AlphaFold3. This repository includes turnkey sampling pipelines for BioEmu and ESMdiff only; decoys produced by AlphaFlow, AFsample2, and AlphaFold3 must be generated externally and then provided to this pipeline.
The project is managed with uv. Install uv if it is not already available:
curl -LsSf https://astral.sh/uv/install.sh | shCreate the base environment (Python 3.10 or 3.11):
uv syncThe sync step resolves all core dependencies defined in pyproject.toml and produces a .venv directory in the project root. Use uv run to execute repository commands inside this environment, for example:
uv run python tools/tools_generate_conformations.py --helpBioEmu relies on a patched ColabFold installation for structure refinement. The following steps create the auxiliary environment and apply the required modifications:
* conda create -n colabfold_env python=3.10
* conda activate colabfold_env
* pip install uv
* export VENV_FOLDER=/mnt/rna01/chenw/anaconda3/envs/colabfold_env
* uv pip install --python ${VENV_FOLDER}/bin/python 'colabfold[alphafold-minus-jax]==1.5.4'
* uv pip install --python ${VENV_FOLDER}/bin/python --force-reinstall "jax[cuda12]"==0.4.35 "numpy==1.26.4"
* export SITE_PACKAGES_DIR=${VENV_FOLDER}/lib/python3.10/site-packages
* patch ${SITE_PACKAGES_DIR}/alphafold/model/modules.py ${SCRIPT_DIR}/modules.patch
* patch ${SITE_PACKAGES_DIR}/colabfold/batch.py ${SCRIPT_DIR}/batch.patch
* touch ${VENV_FOLDER}/.COLABFOLD_PATCHED
* The BIOEMU_COLABFOLD_DIR is `/mnt/rna01/chenw/anaconda3/envs/colabfold_env`
* vi /mnt/rna01/chenw/WorkSpace_Bio/bioemu/src/bioemu/get_embeds.py, change the line of code `return subprocess.run(cmd, env=colabfold_env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)` to `return subprocess.run(['conda', "run", "-n", "colabfold_env", *cmd], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)`
* pip install esm==3.0.4
* pip install -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple tokenizers
* pip install -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple transformersSet the following environment variables before invoking the sampler:
export BIOEMU_COLABFOLD_DIR=/mnt/rna01/chenw/anaconda3/envs/colabfold_env
export CUDA_HOME=/mnt/apps/cuda_12.1.0The ESMdiff baseline requires additional model checkpoints and configuration files. Please consult the documentation located in configs/esmdiff for detailed installation and usage instructions.
BioEmu sampler
uv run python tools/tools_generate_conformations.py \
--fasta_file_path benchmark_seqs.fasta \
--sampler_type bioemu \
--sample_size 3000 \
--save_path ./bioemuESMdiff sampler
uv run python tools/tools_generate_conformations.py \
--fasta_file_path benchmark_seqs.fasta \
--sampler_type esmdiff \
--sample_size 3000 \
--save_path ./esmdiff \
--ckpt_path /mnt/rna01/chenw/WorkSpace_Bio/esmdiff/data/ckpt/release_v0.pt \
--sample_mode ddpm \
--sample_steps 1000 \
--model_config_path ./configs/esmdiff/experiment/mdlm.yamlThe src/eval module provides a comprehensive pipeline for evaluating the quality and similarity of protein decoy ensembles against ground truth conformational ensembles (typically from MD simulations). The evaluation includes multiple metrics inspired by the ESMDiff paper and common structural bioinformatics practices:
Available Metrics:
-
Jensen-Shannon Divergence (JS-Div):
-
JS-PwD: Based on C-alpha pairwise distance distributions -
JS-Rg: Based on Radius of Gyration distributions -
JS-TIC: Based on Time-lagged Independent Components (derived from pairwise distances)
-
-
Ensemble Coverage:
-
RMSD-ens: Average minimum C-alpha RMSD of GT structures to the generated ensemble -
TM-ens: Average maximum TM-score of GT structures to the generated ensemble
-
-
Structural Validity:
Validity_Model: Fraction of clash-free structures in the generated ensemble
Additional Dependencies for Evaluation:
# Install evaluation-specific dependencies
pip install biopython deeptime
# Download and compile TMalign (required for TM-ens metric)
# Get TMalign from: https://zhanggroup.org/TM-align/Running Evaluation:
# Basic evaluation example
uv run python src/eval/eval_decoy_metrics.py \
--protein_id T1033 \
--model_name esmdiff \
--native_filter all \
--model_decoys_root_dir /path/to/model/decoys \
--native_pdb_root_dir /path/to/native/pdbs \
--gt_decoys_root_dir /path/to/ground/truth \
--tmalign_path /path/to/TMalign_cpp \
--output_dir /path/to/output
# Options:
# --native_filter: Filter decoys by TM-score to native (all/near/non)
# --tmscore_threshold: TM-score threshold for filtering (default: 0.5)
# --skip_tica: Skip JS-TIC calculation (time-consuming)
# --skip_tmens: Skip TM-ens calculation (requires TMalign)For detailed documentation and examples, see src/eval/readme_eval.md.
- Compute the free-energy landscape:
uv run python tools/tools_calculate_free_energy_landscape.py- Compute the energy overlap
uv run python tools/tools_calculate_fel_overlpas.py --generated-file-path /mnt/dna01/library2/caspdynamics/generated_data - Compute the plausibilty scores (PCPM, PCPS, CGM, CGMS):
# step 1, generate pcpm
uv run python tools/tools_pdb_list_to_pre_pcpm.py # you need to modify the script
uv run python tools/tools_pre_pcpm_to_pcpm.py # you need to modify the script
# you need to run above step 1, to have the PCPM of your conformations,
# as well as the PCPM of ProteinConformers, then you can calculate the divergence
# between your PCPM and ground truth PCPM (ProteinConformers' PCPM)
# step 2, generate pcps
uv run python tools/tools_pcpm_to_pcps.py # you need to modify the script
# step 3, generate CGM
uv run python tools/tools_pre_pcpm_to_CGM.py # you need to modify the script
# step 4, generate CGMS
uv run python tools/tools_CGM_to_CGMS.py # you need to modify the script- Compute the Ramachandran outliers:
uv run python tools/tools_calculate_rama_outlier_rates.py # you need to modify the scriptIf you use ProteinConformers in your research, please cite:
@inproceedings{ProteinConformers,
author = {Yihang Zhou, Chen Wei, Matthew M. Sun, Jin Song, Yang Li, Lin Wang and Yang Zhang},
title = {ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibility},
booktitle = {Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)},
year = {2025},
note = {Poster},
url = {https://neurips.cc/virtual/2025/poster/121755},
doi = {},
publisher = {}
}