Skip to content

AbsarRaashid3/BayaanX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

🌍 BayaanX - Arabic to English Neural Machine Translation

BayaanX is a Neural Machine Translation (NMT) system that translates Arabic text into English using a Transformer-based Sequence-to-Sequence model. It is trained on the Helsinki-NLP/tatoeba_mt dataset and includes preprocessing, training, inference, and a web-based translation interface using Streamlit.


📌 Features

Preprocessing pipeline to generate training and evaluation data
Custom Vocabulary Builder with tokenization
Transformer-based Encoder-Decoder model
Training script with batch processing and multi-head attention
Inference script for sentence translation
Interactive Web App for real-time translation


🚀 Installation & Setup

1️⃣ Clone the Repository

git clone https://github.com/AbsarRaashid3/NMT-Arabic-English.git
cd NMT-Arabic-English

2️⃣ Install Dependencies

Make sure you have Python 3.8+ installed. Then, install the required packages:

pip install -r requirements.txt

3️⃣ Download & Preprocess the Dataset

Generate training and test pairs:

python src/preprocess.py --output_file data/train_pairs.txt --split validation
python src/preprocess.py --output_file data/test_pairs.txt --split test

4️⃣ Build Vocabulary

python src/vocab.py --pairs_file data/train_pairs.txt --src_vocab_file src/src_vocab.pkl --tgt_vocab_file src/tgt_vocab.pkl

5️⃣ Train the Model

python src/train.py --pairs_file data/train_pairs.txt --src_vocab_file src/src_vocab.pkl --tgt_vocab_file src/tgt_vocab.pkl --epochs 50 --batch_size 32

6️⃣ Run Inference

To translate an Arabic sentence:

python src/infer.py --model_checkpoint transformer_nmt.pt --src_vocab_file src/src_vocab.pkl --tgt_vocab_file src/tgt_vocab.pkl --input_sentence "يا له من مغامر !"

7️⃣ Run the Web App

Launch the Streamlit web interface for translation:

streamlit run src/app.py

🎯 Model Architecture

The translation model is based on the Transformer architecture using Multi-Head Attention and Positional Encoding.

It includes:

Encoder: Multi-layer Transformer Encoder Decoder: Transformer Decoder with attention to the encoder’s outputs Token Embeddings and Positional Encoding Beam Search / Greedy Search for inference

📊 Training Details

Dataset: Helsinki-NLP/tatoeba_mt (Arabic-English)
Model Size: 2-layer Transformer with 256 hidden units
Optimizer: Adam (lr=1e-4)
Loss Function: Cross-Entropy
Batch Size: 32
Epochs: 50

✨ Credits & Acknowledgments

Hugging Face Datasets – for providing the Tatoeba Arabic-English dataset PyTorch – for the deep learning framework Streamlit – for the interactive UI

📌 Developed by Absar Raashid

Some Results:

NMT11 NMT2 NMT3

About

BayaanX is a Transformer-based Neural Machine Translation (NMT) model for translating Arabic to English. Built with PyTorch, it features multi-head attention, positional encoding, and a Streamlit-based web interface. The project includes data preprocessing, vocabulary building, training, and inference scripts for seamless translation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages