This repository features an enhanced implementation of PiFold, achieving a sequence recovery rate of ~55% — a mesurable improvement over the original baseline (51.66%). In addition to the optimized model, I have included several collaborative experiments in inverse folding and protein design in partnership with E. Alcaide, R. Klypa, and C. Liu. We hope our experience help others.
This is a curated CATH 4.3 dataset for PiFold (an updated version of CATH 4.2 by Ingraham, et al, NeurIPS 2019).
This new version included better structures (PDB-REDO), more chains, the last CATH release, included gaps (noted by "-"), removed Tags and missing regions (noted as "X" with NaN coordinates), removed tags, and cases with large missing regions.
I think that CATH dataset sould be use just for primary testing bur not for comparision (many of the structures are just domains, not desingnabel at all!).
Preprocessed data and splits can be found here: cathPi.tgz:
- chain_set.jsonl Max sequence length 500 aa
- chain_set_splits.json Test: 1422 Train: 18960 Validation: 1436
Original CATH 4.2:
wget -r -nd -np http://people.csail.mit.edu/ingraham/graph-protein-design/data/cath/ -P data/cath_4.2
sed -i s/VRSYDPULGCA/VRSYDPCLGCA/ chain_set.jsonl
sed -i s/IENVASLUGTT/IENVASLCGTT/ chain_set.jsonl
sed -i s/TQSULURVQ/TQSCLCRVQ/ chain_set.jsonl
sed -i s/RLDPSEYABVKAQFLVRAN/RLDPSEYARVKAQFLVRAN/ chain_set.jsonl
sed -i s/PGMGVOGPETSL/PGMGVKGPETSL/ chain_set.jsonl
From the PROTEINMPNN
wget https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02.tar.gz
place in data/pdb_2021aug02
mkdir data/mpnn_data
A downloable json version can be found here: mpnn.tgz
Cath:
python main.py --ex_name CATH --data_name CATH
python main.py --ex_name CATHPI --data_name CATHPI
PDB
python main.py --ex_name MPNN --data_name MPNN
using json
python main.py --ex_name MPNNj --data_name MPNNj --k_neighbors 30
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --pdb testNR156/4HJO.pdb.gz --chain A
PDB: test/4HJO.pdb.gz A
Warning!! protein with gaps(-) or missing atoms (X):
True LLRILKETEFKKIKVLGSGAFGTVYKGLWIP----VKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICL
Pred SLALLKEEEYTKEEVLGTYDFGTVYYGYWTP----PSFPVAIKELYANVSPLDKEAILELAEVMASVDHENVARLLGIHL
* .*** *. * .***. ***** * * *XXXX ******* ** . **. * ******. .* ***** *
True TSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLL
Pred SDTIKLVTELYPLGCLLDYVREHKETIGAKTLLNWCVQVAAGADYLAKHNLIHGDLAAANILVETPEHVKIADFGLAEIL
. *..*.*.* * ***********. **.. *******.* * .** *.* **** *.**.**.**** *****..*
True GAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTID
Pred GGFEKAYHGVGQSRSTRWMALETIKHRKFTHKSDVWSFGVTVWELLTFGEEPFAGIPDEEIADILEAGERLPKPPISTDE
* ** ** * .*****.* ** .**.*****.*******.*** .*. *** **. *** *****.*** * .
True VYMIMRKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGD
Pred VYEIIDDCWQKDADKRPRFKELIDTFSKMAKDPEKYLHIEGH
** *. ** *** **.*.*** *****.**..** *.*
Recovery: 0.60 Nssr: 0.75
Test Non Redundant benchmark SPIN-CGNN
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name PDBDIR --pdbs_dir testNR156
CATH 4.2 test from pdb files
cp list_cath.txt list.txt
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name PDBDIR --pdbs_dir cath_test
PDB_struct
cd cath_test; cp list_PDBstruc_test.txt list.txt; cd ..
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name PDBDIR --pdbs_dir cath_test
LASTEST PDBS
python main.py --from_pretrained best48_MPNN/checkpoint.pth --data_name PDBDIR --pdbs_dir latestPDBs
We explored an optimization of PiFold using the entropy-based refining strategy proposed in ProRefiner (Nature Comm. 14, 7434, 2023). However, in our specific implementation and testing environment, this integration did not yield a performance improvement over the baseline
python main.py --gpu_id 2 --from_pretrained best48_CATHPI/checkpoint.pth --data_name CATHPI --jsonl 1
./results/pifold_results_all.jsonl --> new chain_set.jsonl including logits
python main.py --ex_name CATHPI_lm03s--data_name CATHPI --num_encoder_layers 6 --epoch 50 --optim AdamW --batch_size 8 --checkpoint 1 --num_heads 8 --num_workers 8 --lm_model 1 --hidden_dim 256 --num_heads 16 --w_decay 0.01 --dropout 0.1 --node_adjdirect 1 --lm_mask 0.3 --lm_perm 0.025 --edge_angle 1 --edge_direct 1 --cayley 1 --edge_dist 1 --lr 1e-3 --rotary_posemb 1
This is vector version as VFN, it is slighly better but with a huge memory cost.
python main.py --ex_name best30_CATHPI_cong --data_name CATHPI --epoch 50 --node_direct 1 --method PiFold2v --node_adjdirect 1 --node_direct 1
Here we explore similar approach described in VFN paper (https://arxiv.org/pdf/2310.11802) to to refine the predictions made by Pifod results a ESM model. This improves recoveries up to 61-62% To this end:
- Generate jsonl with log_probs
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name CATHPI --jsonl 1
python main.py --from_pretrained best30_MPNN/checkpoint.pth --data_name MPNNj --jsonl 1
- Raplace original ESM code:
cd refiner
pip install -e .
cp refiner/esm2_file/esm2.py to ems2 directory
- Train
python experiments/train.py experiment.name="best48_CATHPI" experiment.num_epoch=3 experiment.weightLoss=False model.esm2_model=2 experiment.batch_size=1 data.jsonl_path="/home/pablo/PiFold2/results/pifold_results_all.jsonl"
- Inference
python experiments/inference.py experiment.name="best48_CATHPI" experiment.weightLoss=False model.esm2_model=2 data.jsonl_path="/home/pablo/PiFold2/results/pifold_results_all.jsonl" experiment.warm_start="./ckpt/best48_CATHPI/28D_03M_2025Y_11h_07m_43s/step_56880.pth"
Metrics
python ../metrics.py LORA ./inference_outputs/28D_03M_2025Y_14h_40m_46s/best48_CATHPI_results.json
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name CATH --json 1
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name CATHPI --json 1
python main.py --from_pretrained best48_MPNN/checkpoint.pth --data_name MPNN --json 1
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name PDBDIR --pdbs_dir cath_test --json 1
python metrics.py (plots from results/pifold_results.json)
conda create --name pifold python=3.12 gcc
conda activate pifold
conda install -c conda-forge biopython
conda update --all
conda install tqdm -c conda-forge
conda install 'pandas<3.0.0'
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu129
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.8.0+cu129.html
pip install joblib einops
conda install -c conda-forge scikit-learn
conda install -c conda-forge matplotlib
conda clean -a
If you are interested in our repository and our paper, please cite the following paper:
@article{gao2023pifold,
title={PiFold: Toward effective and efficient protein inverse folding},
author={Zhangyang Gao and Cheng Tan and Pablo Chacón and Stan Z. Li},
year={2023},
eprint={2209.12643},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
If you have any issues with this work, please feel free to contact us