This repository contains a from-scratch implementation of key components from:
Biao Xu, Yao Wang, Xiuwu Liao, Kaidong Wang
Efficient Fraud Detection Using Deep Boosting Decision Trees (2023)
The implementation focuses on fraud detection with highly imbalanced data using:
- Soft Decision Trees (SDT)
- Deep Boosting Decision Trees with SGD-style optimization (DBDT-SGD)
- Prototype DBDT-Com-style training scaffold for exploratory comparison
- Imbalance-aware preprocessing and evaluation strategy
Reproduce and study the DBDT method for credit card fraud detection while preserving:
- tree-like interpretability,
- deep representation capacity,
- robust evaluation on imbalanced data.
- Name: Credit Card Fraud Detection
- Source: Kaggle - mlg-ulb/creditcardfraud
- Access in code: loaded with
kagglehubinsrc/preprocessing.py
-
main.ipynb
Main experimental notebook (data prep, training, threshold tuning, evaluation, plots). -
src/preprocessing.py
Data loading, train/validation/test splitting, scaling, IQR filtering, SMOTE, plotting, and metrics helpers. -
src/sdt.py
Soft Decision Tree implementation:- inner-node MLP routing,
- soft/hard forward passes,
- path probability computation,
- Xavier initialization.
-
src/dbdt.py
DBDT-SGD implementation:- ensemble of SDTs,
- exponential-loss residual fitting,
- local + global objective accumulation,
- regularization terms.
-
src/pdsca.py
Experimental DBDT-Com trainer scaffold:- ensemble score aggregation,
- minibatch training wrapper,
- parameter updates and test-time scoring.
Note: this is a simplified prototype and not a full Algorithm-2 (PDSCA) implementation from the paper.
-
src/baselines.py
Baseline model definitions and unified score extraction:- logistic regression,
- random forest,
- MLP,
- optional XGBoost / LightGBM.
-
src/evaluation.py
Metric and validation utilities:- stratified 10-fold CV for baseline comparisons,
- AUC / H-measure / F1 / precision / recall summaries.
Python 3.10+ is recommended.
Install required dependencies:
pip install torch numpy scikit-learn imbalanced-learn matplotlib tqdm kagglehub hmeasureIf running on GPU, install a CUDA-compatible PyTorch build matching your system.
- Open
main.ipynb. - Run cells top to bottom.
- Confirm dataset loads successfully from Kaggle.
- Train and evaluate
DBDT_SGD(primary recreated pipeline). - Run exploratory
DBDT-Comprototype cells if needed. - Run baseline CV summary for contextual comparison.
The full dataset is highly imbalanced, and applying SMOTE can greatly increase training size and compute time.
To keep experiments feasible and reproducible, this project supports using a stratified subset of the data before training:
- preserve class ratio with stratified sampling,
- apply SMOTE only on training data,
- keep validation/test distribution realistic,
- report sizes before and after SMOTE.
This is an intentional experimental constraint and should be documented in report/presentation methodology.
The paper's reported setup uses larger dedicated hardware and full-data cross-validation. In this repository:
- DBDT experiments are typically run with a stratified holdout protocol due to compute limits.
- Baseline models are compared using stratified 10-fold CV.
- Load data and convert labels to
{-1, +1} - Train/test split (stratified)
- Train/validation split (stratified)
- Standardize selected numeric features
- Remove outliers with IQR (training set)
- Apply SMOTE (training set only)
- Train SDT/DBDT-SGD
- Tune decision threshold on validation set
- Evaluate on test set with metrics and plots
- (Optional) run DBDT-Com prototype and baseline CV for comparison
The notebook reports multiple metrics suitable for imbalanced classification:
- AUC
- F1-score
- Precision
- Recall
- H-measure
- Confusion matrix
- ROC curve
- Fix random seeds for:
- sampling,
- splits,
- SMOTE,
- PyTorch initialization.
- Log dataset size at each stage:
- full/subset,
- train/val/test,
- post-IQR,
- post-SMOTE.
- Record training hyperparameters (
T, depth, hidden size, learning rate, epochs, batch size).
This work is an educational implementation for a course project based on the referenced paper and public dataset.