ProteinConformers

Official repository for ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibility and ProteinConformers: large-scale and energetically profiled descriptions of protein conformational landscapes.

Overview

ProteinConformers provides data loaders, sampling pipelines, and evaluation utilities for benchmarking generative models of protein conformations. The repository currently includes reference implementations for BioEmu- and ESMdiff-based samplers and a suite of downstream metrics covering free-energy estimation, population coverage, and structural plausibility.

Environment Setup

To generate decoy structures, users must install and correctly configure the environments for AlphaFlow, BioEmu, ESMdiff, AFsample2, and AlphaFold3. This repository includes turnkey sampling pipelines for BioEmu and ESMdiff only; decoys produced by AlphaFlow, AFsample2, and AlphaFold3 must be generated externally and then provided to this pipeline.

Python environment (uv)

The project is managed with uv. Install uv if it is not already available:

curl -LsSf https://astral.sh/uv/install.sh | sh

Create the base environment (Python 3.10 or 3.11):

uv sync

The sync step resolves all core dependencies defined in pyproject.toml and produces a .venv directory in the project root. Use uv run to execute repository commands inside this environment, for example:

uv run python tools/tools_generate_conformations.py --help

Optional: BioEmu ColabFold backend

BioEmu relies on a patched ColabFold installation for structure refinement. The following steps create the auxiliary environment and apply the required modifications:

* conda create -n colabfold_env python=3.10
* conda activate colabfold_env
* pip install uv
* export VENV_FOLDER=/mnt/rna01/chenw/anaconda3/envs/colabfold_env
* uv pip install --python ${VENV_FOLDER}/bin/python 'colabfold[alphafold-minus-jax]==1.5.4' 
* uv pip install --python ${VENV_FOLDER}/bin/python --force-reinstall "jax[cuda12]"==0.4.35 "numpy==1.26.4"
* export SITE_PACKAGES_DIR=${VENV_FOLDER}/lib/python3.10/site-packages
* patch ${SITE_PACKAGES_DIR}/alphafold/model/modules.py ${SCRIPT_DIR}/modules.patch 
* patch ${SITE_PACKAGES_DIR}/colabfold/batch.py ${SCRIPT_DIR}/batch.patch
* touch ${VENV_FOLDER}/.COLABFOLD_PATCHED
* The BIOEMU_COLABFOLD_DIR is `/mnt/rna01/chenw/anaconda3/envs/colabfold_env`
*  vi /mnt/rna01/chenw/WorkSpace_Bio/bioemu/src/bioemu/get_embeds.py, change the line of code `return subprocess.run(cmd, env=colabfold_env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)` to `return subprocess.run(['conda', "run", "-n", "colabfold_env", *cmd], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)`
* pip install esm==3.0.4
* pip install -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple tokenizers
* pip install -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple transformers

Set the following environment variables before invoking the sampler:

export BIOEMU_COLABFOLD_DIR=/mnt/rna01/chenw/anaconda3/envs/colabfold_env
export CUDA_HOME=/mnt/apps/cuda_12.1.0

Optional: ESMdiff environment

The ESMdiff baseline requires additional model checkpoints and configuration files. Please consult the documentation located in configs/esmdiff for detailed installation and usage instructions.

Usage

Sampling protein conformations

BioEmu sampler

uv run python tools/tools_generate_conformations.py \
    --fasta_file_path benchmark_seqs.fasta \
    --sampler_type bioemu \
    --sample_size 3000 \
    --save_path ./bioemu

ESMdiff sampler

uv run python tools/tools_generate_conformations.py \
    --fasta_file_path benchmark_seqs.fasta \
    --sampler_type esmdiff \
    --sample_size 3000 \
    --save_path ./esmdiff \
    --ckpt_path /mnt/rna01/chenw/WorkSpace_Bio/esmdiff/data/ckpt/release_v0.pt \
    --sample_mode ddpm \
    --sample_steps 1000 \
    --model_config_path ./configs/esmdiff/experiment/mdlm.yaml

Evaluation utilities

Comprehensive Decoy Ensemble Evaluation

The src/eval module provides a comprehensive pipeline for evaluating the quality and similarity of protein decoy ensembles against ground truth conformational ensembles (typically from MD simulations). The evaluation includes multiple metrics inspired by the ESMDiff paper and common structural bioinformatics practices:

Available Metrics:

Jensen-Shannon Divergence (JS-Div):
- JS-PwD: Based on C-alpha pairwise distance distributions
- JS-Rg: Based on Radius of Gyration distributions
- JS-TIC: Based on Time-lagged Independent Components (derived from pairwise distances)
Ensemble Coverage:
- RMSD-ens: Average minimum C-alpha RMSD of GT structures to the generated ensemble
- TM-ens: Average maximum TM-score of GT structures to the generated ensemble
Structural Validity:
- Validity_Model: Fraction of clash-free structures in the generated ensemble

Additional Dependencies for Evaluation:

# Install evaluation-specific dependencies
pip install biopython deeptime

# Download and compile TMalign (required for TM-ens metric)
# Get TMalign from: https://zhanggroup.org/TM-align/

Running Evaluation:

# Basic evaluation example
uv run python src/eval/eval_decoy_metrics.py \
    --protein_id T1033 \
    --model_name esmdiff \
    --native_filter all \
    --model_decoys_root_dir /path/to/model/decoys \
    --native_pdb_root_dir /path/to/native/pdbs \
    --gt_decoys_root_dir /path/to/ground/truth \
    --tmalign_path /path/to/TMalign_cpp \
    --output_dir /path/to/output

# Options:
# --native_filter: Filter decoys by TM-score to native (all/near/non)
# --tmscore_threshold: TM-score threshold for filtering (default: 0.5)
# --skip_tica: Skip JS-TIC calculation (time-consuming)
# --skip_tmens: Skip TM-ens calculation (requires TMalign)

For detailed documentation and examples, see src/eval/readme_eval.md.

Other Evaluation Tools

Compute the free-energy landscape:

uv run python tools/tools_calculate_free_energy_landscape.py

Compute the energy overlap

uv run python tools/tools_calculate_fel_overlpas.py --generated-file-path /mnt/dna01/library2/caspdynamics/generated_data

Compute the plausibilty scores (PCPM, PCPS, CGM, CGMS):

# step 1, generate pcpm
uv run python tools/tools_pdb_list_to_pre_pcpm.py # you need to modify the script
uv run python tools/tools_pre_pcpm_to_pcpm.py # you need to modify the script

# you need to run above step 1, to have the PCPM of your conformations,
# as well as the PCPM of ProteinConformers, then you can calculate the divergence
# between your PCPM and ground truth PCPM (ProteinConformers' PCPM)

# step 2, generate pcps
uv run python tools/tools_pcpm_to_pcps.py # you need to modify the script

# step 3, generate CGM
uv run python tools/tools_pre_pcpm_to_CGM.py # you need to modify the script

# step 4, generate CGMS
uv run python tools/tools_CGM_to_CGMS.py # you need to modify the script

Compute the Ramachandran outliers:

uv run python tools/tools_calculate_rama_outlier_rates.py # you need to modify the script

Citation

If you use ProteinConformers in your research, please cite:

@inproceedings{ProteinConformers,
  author    = {Yihang Zhou, Chen Wei, Matthew M. Sun, Jin Song, Yang Li, Lin Wang and Yang Zhang},
  title     = {ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibility},
  booktitle = {Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)},
  year      = {2025},
  note      = {Poster},
  url       = {https://neurips.cc/virtual/2025/poster/121755},
  doi       = {},
  publisher = {}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
src		src
tools		tools
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProteinConformers

Overview

Environment Setup

Python environment (uv)

Optional: BioEmu ColabFold backend

Optional: ESMdiff environment

Usage

Sampling protein conformations

Evaluation utilities

Comprehensive Decoy Ensemble Evaluation

Other Evaluation Tools

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

auroua/ProteinConformers

Folders and files

Latest commit

History

Repository files navigation

ProteinConformers

Overview

Environment Setup

Python environment (uv)

Optional: BioEmu ColabFold backend

Optional: ESMdiff environment

Usage

Sampling protein conformations

Evaluation utilities

Comprehensive Decoy Ensemble Evaluation

Other Evaluation Tools

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages