GitHub - Kaioh17/DBDT: From-scratch PyTorch implementation of Deep Boosting Decision Trees (DBDT) for credit card fraud detection.

Efficient Fraud Detection Using Deep Boosting Decision Trees (DBDT)

This repository contains a from-scratch implementation of key components from:

Biao Xu, Yao Wang, Xiuwu Liao, Kaidong Wang
Efficient Fraud Detection Using Deep Boosting Decision Trees (2023)

The implementation focuses on fraud detection with highly imbalanced data using:

Soft Decision Trees (SDT)
Deep Boosting Decision Trees with SGD-style optimization (DBDT-SGD)
Prototype DBDT-Com-style training scaffold for exploratory comparison
Imbalance-aware preprocessing and evaluation strategy

Project Goal

Reproduce and study the DBDT method for credit card fraud detection while preserving:

tree-like interpretability,
deep representation capacity,
robust evaluation on imbalanced data.

Dataset

Name: Credit Card Fraud Detection
Source: Kaggle - mlg-ulb/creditcardfraud
Access in code: loaded with kagglehub in src/preprocessing.py

Repository Structure

main.ipynb
Main experimental notebook (data prep, training, threshold tuning, evaluation, plots).
src/preprocessing.py
Data loading, train/validation/test splitting, scaling, IQR filtering, SMOTE, plotting, and metrics helpers.
src/sdt.py
Soft Decision Tree implementation:
- inner-node MLP routing,
- soft/hard forward passes,
- path probability computation,
- Xavier initialization.
src/dbdt.py
DBDT-SGD implementation:
- ensemble of SDTs,
- exponential-loss residual fitting,
- local + global objective accumulation,
- regularization terms.
src/pdsca.py
Experimental DBDT-Com trainer scaffold:
- ensemble score aggregation,
- minibatch training wrapper,
- parameter updates and test-time scoring.
Note: this is a simplified prototype and not a full Algorithm-2 (PDSCA) implementation from the paper.
src/baselines.py
Baseline model definitions and unified score extraction:
- logistic regression,
- random forest,
- MLP,
- optional XGBoost / LightGBM.
src/evaluation.py
Metric and validation utilities:
- stratified 10-fold CV for baseline comparisons,
- AUC / H-measure / F1 / precision / recall summaries.

Environment Setup

Python 3.10+ is recommended.

Install required dependencies:

pip install torch numpy scikit-learn imbalanced-learn matplotlib tqdm kagglehub hmeasure

If running on GPU, install a CUDA-compatible PyTorch build matching your system.

How to Run

Open main.ipynb.
Run cells top to bottom.
Confirm dataset loads successfully from Kaggle.
Train and evaluate DBDT_SGD (primary recreated pipeline).
Run exploratory DBDT-Com prototype cells if needed.
Run baseline CV summary for contextual comparison.

Practical Compute Note (Important)

The full dataset is highly imbalanced, and applying SMOTE can greatly increase training size and compute time.

To keep experiments feasible and reproducible, this project supports using a stratified subset of the data before training:

preserve class ratio with stratified sampling,
apply SMOTE only on training data,
keep validation/test distribution realistic,
report sizes before and after SMOTE.

This is an intentional experimental constraint and should be documented in report/presentation methodology.

The paper's reported setup uses larger dedicated hardware and full-data cross-validation. In this repository:

DBDT experiments are typically run with a stratified holdout protocol due to compute limits.
Baseline models are compared using stratified 10-fold CV.

Current Pipeline Summary

Load data and convert labels to {-1, +1}
Train/test split (stratified)
Train/validation split (stratified)
Standardize selected numeric features
Remove outliers with IQR (training set)
Apply SMOTE (training set only)
Train SDT/DBDT-SGD
Tune decision threshold on validation set
Evaluate on test set with metrics and plots
(Optional) run DBDT-Com prototype and baseline CV for comparison

Metrics Used

The notebook reports multiple metrics suitable for imbalanced classification:

AUC
F1-score
Precision
Recall
H-measure
Confusion matrix
ROC curve

Reproducibility Tips

Fix random seeds for:
- sampling,
- splits,
- SMOTE,
- PyTorch initialization.
Log dataset size at each stage:
- full/subset,
- train/val/test,
- post-IQR,
- post-SMOTE.
Record training hyperparameters (T, depth, hidden size, learning rate, epochs, batch size).

Acknowledgment

This work is an educational implementation for a course project based on the referenced paper and public dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
doc		doc
presentation		presentation
sources		sources
src		src
.gitignore		.gitignore
README.md		README.md
final_cm_plot.png		final_cm_plot.png
final_precision_recall.png		final_precision_recall.png
final_roc_curve.png		final_roc_curve.png
main.ipynb		main.ipynb
model.txt		model.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Fraud Detection Using Deep Boosting Decision Trees (DBDT)

Project Goal

Dataset

Repository Structure

Environment Setup

How to Run

Practical Compute Note (Important)

Current Pipeline Summary

Metrics Used

Reproducibility Tips

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Efficient Fraud Detection Using Deep Boosting Decision Trees (DBDT)

Project Goal

Dataset

Repository Structure

Environment Setup

How to Run

Practical Compute Note (Important)

Current Pipeline Summary

Metrics Used

Reproducibility Tips

Acknowledgment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages