A comprehensive machine learning project for detecting and preventing health insurance fraud using advanced classification models and domain-specific heuristic triggers.
This project develops a fraud detection system that uses machine learning models to identify potentially fraudulent health insurance claims. The system combines traditional ML algorithms with domain-specific fraud detection triggers to provide a robust, comprehensive fraud detection solution.
- Red flag claims for possible fraud and further investigation
- Reduce losses and increase profitability for insurers/reinsurers
- Reduce premiums for customers' insurance coverage
- Provide competitive advantage for insurers/reinsurers
- Multiple ML models: Decision Tree, Random Forest, XGBoost, GLM, Naive Bayes, GBM
- Imbalanced dataset handling: ADASYN, SMOTE, MWMOTE, ROSE
- Domain-specific fraud triggers (8 types)
- Comprehensive performance evaluation with ROC-AUC, PR-AUC, and classification metrics
CAS-Project-Health-Insurance-Fraud-Detection/
├── data/
│ ├── raw/ # Original, unprocessed data
│ ├── interim/ # Intermediate transformed data
│ └── external/ # External data sources (triggers, hospital lists)
├── src/
│ ├── features/ # Feature engineering & trigger functions
├── results/ # Model outputs and results
├── docs/ # Additional documentation
├── main.Rmd # Main markdown file which contains the code
└── README.md
R Version: 4.0 or higher
Required R Packages:
# Data manipulation
library(readxl)
library(janitor)
library(dplyr)
library(tidyverse)
library(lubridate)
# Machine Learning
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(xgboost)
library(gbm)
library(e1071)
library(naivebayes)
library(glmnet)
# Imbalanced data handling
library(smotefamily)
library(imbalance)
library(ROSE)
# Evaluation & Visualization
library(pROC)
library(PRROC)
library(ROCR)
library(ggplot2)
library(kableExtra)
# Utilities
library(psych)
library(bruceR)
library(caTools)
library(mlbench)- Clone the repository:
git clone https://github.com/RohanYashraj/CAS-Project-Health-Insurance-Fraud-Detection.git
cd CAS-Project-Health-Insurance-Fraud-Detection- Install R dependencies:
# Run this in R console
install.packages(c("readxl", "janitor", "dplyr", "tidyverse", "lubridate",
"caret", "rpart", "rpart.plot", "randomForest", "xgboost",
"gbm", "e1071", "naivebayes", "glmnet", "smotefamily",
"imbalance", "ROSE", "pROC", "PRROC", "ROCR", "ggplot2",
"kableExtra", "psych", "bruceR", "caTools", "mlbench"))- Prepare Data: Place your cleaned input data in
data/raw/cleaned_input_data.rds - Run Main Analysis: Open and knit
notebooks/02-modeling/final_code.Rmd - Review Results: Check
results/folder for outputs
Main analysis workflow:
- Data preprocessing and feature engineering
- Application of fraud detection triggers
- Dataset balancing using multiple techniques
- Model training across 6 algorithms
- Performance evaluation and comparison
The project implements 8 domain-specific fraud triggers:
- Claim Amount Trigger: Flags claims exceeding expected package amounts
- Hospital Days Trigger: Identifies excessive length of stay
- Age Trigger: Flags procedures inappropriate for patient's age group
- Gender Trigger: Detects gender-specific procedure mismatches
- Claim Count Trigger: Identifies excessive claim frequency
- Close Proximity Trigger: Flags claims filed too soon after policy commencement
- Treatment Date Validity: Ensures treatments within policy coverage periods
- Claim Reported Delay: Identifies unusual reporting delays
- Decision Tree: Interpretable rule-based classifier
- Random Forest: Ensemble of decision trees
- XGBoost: Gradient boosting framework
- GLM: Generalized Linear Model (Logistic Regression)
- Naive Bayes: Probabilistic classifier
- GBM: Generalized Boosted Regression Model
Each model is evaluated with and without four different balancing techniques:
- ADASYN: Adaptive Synthetic Sampling
- SMOTE: Synthetic Minority Oversampling Technique
- MWMOTE: Majority Weighted Minority Oversampling Technique
- ROSE: Random Over-Sampling Examples
Models are evaluated using:
- Accuracy: Overall classification accuracy
- Sensitivity (Recall): True positive rate
- Specificity: True negative rate
- Precision: Positive predictive value
- F1-Score: Harmonic mean of precision and recall
- F2-Score: Weighted F-score prioritizing recall
- ROC-AUC: Area under ROC curve
- PR-AUC: Area under Precision-Recall curve
Comprehensive results comparing all model configurations are available in:
results/final_output.xlsx: Detailed performance metricsresults/results.rds: R data format
SSSIHL-CAS Project Team
This project is proprietary. To access, please send request to Authors of the article.
- SSSIHL for institutional support
- Insurance industry partners for data and domain expertise
For questions or access requests, please contact the project authors.
Note: Due to the sensitive nature of fraud detection data, the data has been masked.