AllerTrans

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Overview

Allergens are a major concern in protein safety, especially with the growing use of recombinant proteins in medical products. Traditional allergenicity tests are costly and time-consuming, prompting the need for efficient bioinformatics solutions. In this study, we developed an enhanced deep learning model that classifies proteins as allergenic or non-allergenic based on their sequences. Our method extracts features using two protein language models and combines them in a deep neural network, followed by ensemble modeling to improve performance. The proposed model achieved strong results: 97.91% sensitivity, 97.69% specificity, 97.80% accuracy, and a 99% AUC using five-fold cross-validation.

DOI: https://doi.org/10.1093/biomethods/bpaf040

Online Prediction Tool

You can try out the AllerTrans model directly available on Hugging Face Spaces: https://huggingface.co/spaces/sfaezella/AllerTrans

A comprehensive flowchart that includes all of our experiments

Repository Structure

For transparency, this repository includes all the experiments, feature extraction, modeling notebooks, and tools necessary to reproduce the AllerTrans workflow.

AllerTrans/
├── notebooks/                  # All Jupyter notebooks, organized by workflow
│   ├── feature-extraction/     # Notebooks for extracting protein feature vectors
│   │   ├── 1.ESM-v2-embeddings.ipynb
│   │   ├── 2.ProtT5-embeddings.ipynb
│   │   └── 3.AAC-feature-vectors.ipynb
│   ├── modeling/               # Notebooks for training and evaluating models
│   │   ├── 1D-CNN.ipynb
│   │   ├── classic-machine-learning.ipynb
│   │   ├── nonlinear-DNN.ipynb
│   │   └── single-layer-LSTM.ipynb
│   └── additional-experiments/ # Supplementary experiments
├── src/                        # Users can run predictions on their own protein sequences in FASTA format via a single command.
│   ├── allertrans/             # CLI and scripts for end-to-end inference. 
│   │   ├── __main__.py         # Entry point for CLI execution
│   │   ├── cli.py              # Command-line interface
│   │   ├── model.py            # Loading models and running predictions
│   │   └── utils.py            # Helper functions and utilities
│   ├── checkpoints/            # Pretrained model weights
│   ├── examples/               # Example input and output files
│   ├── extract.py              # Script for ESM-2 embedding
│   ├── prott5_embedder.py      # Script for ProtT5 embedding
│   └── run_all.py              # Script to run full workflow
├── images/                     # Figures and diagrams used in notebooks or README
├── inference-app/              # Contains code for the web-based prediction tool hosted on Hugging Face Spaces.
├── requirements.txt            # Python dependencies

Notebooks in detail:

feature-extraction
- 1. ESM-v2-embeddings.ipynb: Extracts embeddings using ESM-v2 model. Input protein sequences in FASTA format.
- 2. ProtT5-embeddings.ipynb: Extracts embeddings using ProtT5 model. Input protein sequences in FASTA format.
- 3. AAC-feature-vectors.ipynb: Generates amino acid composition feature vectors. Input protein sequences in FASTA format.
modeling
- classic-machine-learning.ipynb: Classic machine learning models' training and evaluation, including SVM, RF, XGBoost, and KNN. This notebook also tests the effect of hyperparameter tuning and the autoencoder.
- nonlinear-DNN.ipynb: Train and evaluation of our top-performing deep neural network models, using ESM-v2 and ProtT5 embeddings, and AAC feature vectors. This notebook requires the pretrained model weights located in src/checkpoints/ to run evaluations and reproduce results.
- single-layer-LSTM.ipynb: Training and evaluation of a single-layer LSTM (Long Short-Term Memory) model.
- 1D-CNN.ipynb: Training and evaluation of a 1-dimensional CNN (Convolutional neural network) model.

General AllerTrans Model Architecture

Dataset

The utilized dataset in this study is the public AlgPred 2.0 train and validation sets, that are available here.

CLI Usage for Inference

1. Install Requirements

git clone https://github.com/faezesarlakifar/AllerTrans.git
cd AllerTrans
pip install -r requirements.txt

Make sure torch CPU-only is fine.

2. Run Predictions

cd src
python run_all.py --fasta examples/protein_sequences.fasta --output examples/predictions.csv

--fasta: Path to your input FASTA file (single or multi-sequence).
--output: CSV file to save predictions.

Example Input

File: examples/protein_sequences.fasta

>Sequence_1
MQEAGAVKFDIKNQCGYTVWAAGLPGGGKRLDQGQTWTVNLAAGTASARFWGRTGCTFDASGKGSCQTGDCGRQLSCTVSGAVPATLAEYTQSDQDY
>Sequence_2
MSIQQIIEQKIQKEFQPHFLAIENESHLHHSNRGSESHFKCVIVSADFKNIRKVQRHQRIYQLLNEEL...

Example Output

File: examples/predictions.csv

id	prediction
Sequence_1	Potential Allergen
Sequence_2	Non-Allergen

Replace our examples/protein_sequences.fasta with your own FASTA file containing the sequences you want to classify.

Citation

If our work contributes to your research, please cite:

@ARTICLE{AllerTrans2025,
  author  = {Sarlakifar, Faezeh and Malek, Hamed and Allahyari Fard, Najaf},
  title   = {AllerTrans: a deep learning method for predicting the allergenicity of protein sequences},
  journal = {Biology Methods and Protocols},
  year    = {2025},
  volume  = {10},
  number  = {1},
  doi     = {10.1093/biomethods/bpaf040}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AllerTrans

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Overview

Online Prediction Tool

A comprehensive flowchart that includes all of our experiments

Repository Structure

Notebooks in detail:

General AllerTrans Model Architecture

Dataset

CLI Usage for Inference

1. Install Requirements

2. Run Predictions

Example Input

Example Output

Citation

About

Uh oh!

Releases 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
images		images
inference-app		inference-app
notebooks		notebooks
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

faezesarlakifar/AllerTrans

Folders and files

Latest commit

History

Repository files navigation

AllerTrans

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Overview

Online Prediction Tool

A comprehensive flowchart that includes all of our experiments

Repository Structure

Notebooks in detail:

General AllerTrans Model Architecture

Dataset

CLI Usage for Inference

1. Install Requirements

2. Run Predictions

Example Input

Example Output

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Languages