This project implements UniEmbed, a unified approach for detecting SQL injection attacks through the fusion of advanced Natural Language Processing techniques and Machine Learning classifiers. Based on the research UniEmbed: A Novel Approach to Detect XSS and SQL Injection Attacks Leveraging Multiple Feature Fusion with Machine Learning Techniques (Bakır, 2025), this repository provides a comprehensive, reproducible analysis in Jupyter Notebook style.
- Multi-Feature Embedding: Simultaneously leverages Word2Vec, FastText, and the Universal Sentence Encoder (USE) to extract, enrich, and combine semantic representations of SQL queries.
- State-of-the-Art Modeling: Trains a suite of ML classifiers including MLP, Random Forest, SVM, Logistic Regression, KNN, Naive Bayes, Decision Tree, and voting ensembles.
- Rigorous Evaluation: Assesses detection performance using standard metrics (accuracy, F1, AUC, etc.) and visualization (ROC, confusion matrices).
- Extensible Framework: Clean, modular Python code ready for expansion to XSS or other text-based attacks.
Traditional web application security approaches often fail to detect sophisticated, obfuscated, or novel forms of SQL injection attacks. The UniEmbed method fuses state-of-the-art NLP embedding strategies—Word2Vec (word-level), FastText (character-level), and USE (sentence-level)—to capture both surface and semantic patterns, vastly improving model learning and attack detection.
You will need the SQL Injection dataset by SAJID576 from Kaggle. Download the CSV file and place it in the root directory of the project or specify the correct path in the notebook.
Sample Format:
| Sentence | Label |
|---|---|
SELECT * FROM users WHERE id = 1 |
0 |
SELECT * FROM users WHERE id = 1 OR 1=1-- |
1 |
| ... | ... |
0= benign query1= malicious (SQL injection) query
.
├── UniEmbed_SQLi.ipynb # Main Jupyter notebook implementation
├── sqli_dataset.csv # Place your Kaggle dataset here
├── models/ # Trained embedding models (saved after run)
├── results/ # Evaluation results and artifacts
├── README.md # You're reading this!
- Python 3.8 or newer
- pandas, numpy, scikit-learn, gensim, matplotlib, seaborn
- tensorflow, tensorflow-hub (for Universal Sentence Encoder)
Install them using:
pip install pandas numpy scikit-learn gensim matplotlib seaborn tensorflow tensorflow-hub-
Download the dataset:
Download the "SQL Injection" dataset from Kaggle and place it in your project directory. -
Open and run the notebook:
OpenUniEmbed_SQLi_Detection.ipynbin Jupyter Notebook or JupyterLab, and run each cell in order. -
Configure paths:
If your dataset is not namedSQLi_Dataset.csvor is located elsewhere, change the path at the data loading step. -
Explore the results:
The notebook will output accuracy, F1, confusion matrices, ROC curves, and compare all feature extraction methods and classifiers.
- UniEmbed Fusion: Outperforms individual embedding techniques (Word2Vec, FastText, USE) on SQL injection detection metrics.
- MLP & Voting Classifiers: Achieve exceptionally high accuracy and F1 scores with almost zero false positives/negatives in experiments mimicking the published paper.
- Visualization: Provides immediate understanding of classifier performance with clear, publication-ready plots.
Bakır, R. (2025). UniEmbed: A Novel Approach to Detect XSS and SQL Injection Attacks Leveraging Multiple Feature Fusion with Machine Learning Techniques. Arabian Journal for Science and Engineering.
- SAJID576 for the open SQL injection dataset.
- Tensorflow and Gensim teams for state-of-the-art NLP tools.
- The original author for describing the hybrid feature fusion approach in detail.
Enjoy exploring the frontier of secure web application research with UniEmbed!