Skip to content

chaconlab/PiFold2

Repository files navigation

PiFold 2

This repository features an enhanced implementation of PiFold, achieving a sequence recovery rate of ~55% — a mesurable improvement over the original baseline (51.66%). In addition to the optimized model, I have included several collaborative experiments in inverse folding and protein design in partnership with E. Alcaide, R. Klypa, and C. Liu. We hope our experience help others.

Datasets

CATH 4.3 Pifold

This is a curated CATH 4.3 dataset for PiFold (an updated version of CATH 4.2 by Ingraham, et al, NeurIPS 2019). This new version included better structures (PDB-REDO), more chains, the last CATH release, included gaps (noted by "-"), removed Tags and missing regions (noted as "X" with NaN coordinates), removed tags, and cases with large missing regions.
I think that CATH dataset sould be use just for primary testing bur not for comparision (many of the structures are just domains, not desingnabel at all!).

Preprocessed data and splits can be found here: cathPi.tgz:

- chain_set.jsonl   Max sequence length 500 aa
- chain_set_splits.json   Test: 1422 Train: 18960 Validation: 1436

Original CATH 4.2:

wget -r -nd -np http://people.csail.mit.edu/ingraham/graph-protein-design/data/cath/ -P data/cath_4.2

sed -i s/VRSYDPULGCA/VRSYDPCLGCA/ chain_set.jsonl
sed -i s/IENVASLUGTT/IENVASLCGTT/ chain_set.jsonl
sed -i s/TQSULURVQ/TQSCLCRVQ/ chain_set.jsonl 
sed -i s/RLDPSEYABVKAQFLVRAN/RLDPSEYARVKAQFLVRAN/ chain_set.jsonl 
sed -i s/PGMGVOGPETSL/PGMGVKGPETSL/ chain_set.jsonl

PDB

From the PROTEINMPNN

wget https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02.tar.gz
place in data/pdb_2021aug02
mkdir data/mpnn_data

A downloable json version can be found here: mpnn.tgz

Training

Cath:

python main.py --ex_name CATH --data_name CATH 
python main.py --ex_name CATHPI --data_name CATHPI

PDB

python main.py --ex_name MPNN  --data_name MPNN

using json
python main.py --ex_name MPNNj  --data_name MPNNj --k_neighbors 30

Testing

Single PDB (must gzipped)


python main.py  --from_pretrained best48_CATHPI/checkpoint.pth --pdb testNR156/4HJO.pdb.gz --chain A


PDB: test/4HJO.pdb.gz A

 Warning!! protein with gaps(-) or missing atoms (X):

 True   LLRILKETEFKKIKVLGSGAFGTVYKGLWIP----VKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICL
 Pred   SLALLKEEEYTKEEVLGTYDFGTVYYGYWTP----PSFPVAIKELYANVSPLDKEAILELAEVMASVDHENVARLLGIHL
         * .*** *. * .***.  ***** * * *XXXX   *******    **   . **. * ******. .* ***** *
 True   TSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLL
 Pred   SDTIKLVTELYPLGCLLDYVREHKETIGAKTLLNWCVQVAAGADYLAKHNLIHGDLAAANILVETPEHVKIADFGLAEIL
        . *..*.*.* * ***********. **.. *******.* * .**    *.* **** *.**.**.**** *****..*
 True   GAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTID
 Pred   GGFEKAYHGVGQSRSTRWMALETIKHRKFTHKSDVWSFGVTVWELLTFGEEPFAGIPDEEIADILEAGERLPKPPISTDE
        *  ** **  *     .*****.* ** .**.*****.*******.*** .*. ***  **. *** *****.*** * .
 True   VYMIMRKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGD
 Pred   VYEIIDDCWQKDADKRPRFKELIDTFSKMAKDPEKYLHIEGH
        ** *.  **  *** **.*.***  *****.**..** *.* 

 Recovery: 0.60 Nssr: 0.75

From PDB files

Test Non Redundant benchmark SPIN-CGNN

python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name PDBDIR --pdbs_dir testNR156

CATH 4.2 test from pdb files

cp list_cath.txt list.txt
python main.py  --from_pretrained best48_CATHPI/checkpoint.pth --data_name PDBDIR --pdbs_dir cath_test

PDB_struct

cd cath_test; cp list_PDBstruc_test.txt list.txt; cd ..
python main.py  --from_pretrained best48_CATHPI/checkpoint.pth --data_name PDBDIR --pdbs_dir cath_test

LASTEST PDBS

python main.py  --from_pretrained best48_MPNN/checkpoint.pth --data_name PDBDIR --pdbs_dir latestPDBs

LM version

We explored an optimization of PiFold using the entropy-based refining strategy proposed in ProRefiner (Nature Comm. 14, 7434, 2023). However, in our specific implementation and testing environment, this integration did not yield a performance improvement over the baseline

python main.py --gpu_id 2 --from_pretrained best48_CATHPI/checkpoint.pth  --data_name CATHPI --jsonl 1
./results/pifold_results_all.jsonl --> new chain_set.jsonl including logits 


python main.py --ex_name CATHPI_lm03s--data_name CATHPI --num_encoder_layers 6 --epoch 50 --optim AdamW --batch_size 8 --checkpoint 1 --num_heads 8 --num_workers 8  --lm_model 1 --hidden_dim 256 --num_heads 16 --w_decay 0.01 --dropout 0.1  --node_adjdirect 1 --lm_mask 0.3 --lm_perm 0.025 --edge_angle 1 --edge_direct 1 --cayley 1 --edge_dist 1 --lr 1e-3 --rotary_posemb 1

Vector version

This is vector version as VFN, it is slighly better but with a huge memory cost.

python main.py  --ex_name best30_CATHPI_cong   --data_name CATHPI  --epoch  50  --node_direct 1  --method PiFold2v --node_adjdirect 1 --node_direct 1

Adding ESM using LORA

Here we explore similar approach described in VFN paper (https://arxiv.org/pdf/2310.11802) to to refine the predictions made by Pifod results a ESM model. This improves recoveries up to 61-62% To this end:

  1. Generate jsonl with log_probs
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name CATHPI --jsonl 1
python main.py --from_pretrained best30_MPNN/checkpoint.pth --data_name MPNNj --jsonl 1
  1. Raplace original ESM code:
cd refiner
pip install -e .
cp refiner/esm2_file/esm2.py to ems2 directory
  1. Train
python experiments/train.py experiment.name="best48_CATHPI"  experiment.num_epoch=3 experiment.weightLoss=False model.esm2_model=2 experiment.batch_size=1 data.jsonl_path="/home/pablo/PiFold2/results/pifold_results_all.jsonl"
  1. Inference
python experiments/inference.py experiment.name="best48_CATHPI"  experiment.weightLoss=False model.esm2_model=2 data.jsonl_path="/home/pablo/PiFold2/results/pifold_results_all.jsonl"   experiment.warm_start="./ckpt/best48_CATHPI/28D_03M_2025Y_11h_07m_43s/step_56880.pth"

Metrics

python ../metrics.py LORA ./inference_outputs/28D_03M_2025Y_14h_40m_46s/best48_CATHPI_results.json

Save results and plotting

python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name CATH  --json 1
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name CATHPI  --json 1
python main.py --from_pretrained best48_MPNN/checkpoint.pth --data_name MPNN  --json 1
python main.py --from_pretrained best48_CATHPI/checkpoint.pth --data_name PDBDIR --pdbs_dir cath_test --json 1
python metrics.py (plots from results/pifold_results.json) 

Instructions for our HPC

conda create --name pifold python=3.12 gcc
conda activate pifold
conda install -c conda-forge biopython
conda update --all
conda install tqdm  -c conda-forge
conda install 'pandas<3.0.0'
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu129
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.8.0+cu129.html
pip install joblib einops
conda install -c conda-forge scikit-learn
conda install -c conda-forge matplotlib
conda clean -a

Citation

If you are interested in our repository and our paper, please cite the following paper:

@article{gao2023pifold,
      title={PiFold: Toward effective and efficient protein inverse folding}, 
      author={Zhangyang Gao and Cheng Tan and Pablo Chacón and Stan Z. Li},
      year={2023},
      eprint={2209.12643},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Feedback

If you have any issues with this work, please feel free to contact us

About

Enhancing the inverse folding approach PiFold

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published