Health Insurance Fraud Detection

A comprehensive machine learning project for detecting and preventing health insurance fraud using advanced classification models and domain-specific heuristic triggers.

📋 Project Overview

This project develops a fraud detection system that uses machine learning models to identify potentially fraudulent health insurance claims. The system combines traditional ML algorithms with domain-specific fraud detection triggers to provide a robust, comprehensive fraud detection solution.

Key Objectives

Red flag claims for possible fraud and further investigation
Reduce losses and increase profitability for insurers/reinsurers
Reduce premiums for customers' insurance coverage
Provide competitive advantage for insurers/reinsurers

Key Features

Multiple ML models: Decision Tree, Random Forest, XGBoost, GLM, Naive Bayes, GBM
Imbalanced dataset handling: ADASYN, SMOTE, MWMOTE, ROSE
Domain-specific fraud triggers (8 types)
Comprehensive performance evaluation with ROC-AUC, PR-AUC, and classification metrics

📁 Project Structure

CAS-Project-Health-Insurance-Fraud-Detection/
├── data/
│   ├── raw/              # Original, unprocessed data
│   ├── interim/          # Intermediate transformed data
│   └── external/         # External data sources (triggers, hospital lists)
├── src/
│   ├── features/         # Feature engineering & trigger functions
├── results/              # Model outputs and results
├── docs/                 # Additional documentation
├── main.Rmd              # Main markdown file which contains the code
└── README.md

🚀 Getting Started

Prerequisites

R Version: 4.0 or higher

Required R Packages:

# Data manipulation
library(readxl)
library(janitor)
library(dplyr)
library(tidyverse)
library(lubridate)

# Machine Learning
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(xgboost)
library(gbm)
library(e1071)
library(naivebayes)
library(glmnet)

# Imbalanced data handling
library(smotefamily)
library(imbalance)
library(ROSE)

# Evaluation & Visualization
library(pROC)
library(PRROC)
library(ROCR)
library(ggplot2)
library(kableExtra)

# Utilities
library(psych)
library(bruceR)
library(caTools)
library(mlbench)

Installation

Clone the repository:

git clone https://github.com/RohanYashraj/CAS-Project-Health-Insurance-Fraud-Detection.git
cd CAS-Project-Health-Insurance-Fraud-Detection

Install R dependencies:

# Run this in R console
install.packages(c("readxl", "janitor", "dplyr", "tidyverse", "lubridate",
                   "caret", "rpart", "rpart.plot", "randomForest", "xgboost",
                   "gbm", "e1071", "naivebayes", "glmnet", "smotefamily",
                   "imbalance", "ROSE", "pROC", "PRROC", "ROCR", "ggplot2",
                   "kableExtra", "psych", "bruceR", "caTools", "mlbench"))

Usage

Prepare Data: Place your cleaned input data in data/raw/cleaned_input_data.rds
Run Main Analysis: Open and knit notebooks/02-modeling/final_code.Rmd
Review Results: Check results/ folder for outputs

Main analysis workflow:

Data preprocessing and feature engineering
Application of fraud detection triggers
Dataset balancing using multiple techniques
Model training across 6 algorithms
Performance evaluation and comparison

🔍 Fraud Detection Triggers

The project implements 8 domain-specific fraud triggers:

Claim Amount Trigger: Flags claims exceeding expected package amounts
Hospital Days Trigger: Identifies excessive length of stay
Age Trigger: Flags procedures inappropriate for patient's age group
Gender Trigger: Detects gender-specific procedure mismatches
Claim Count Trigger: Identifies excessive claim frequency
Close Proximity Trigger: Flags claims filed too soon after policy commencement
Treatment Date Validity: Ensures treatments within policy coverage periods
Claim Reported Delay: Identifies unusual reporting delays

🤖 Machine Learning Models

Algorithms Evaluated

Decision Tree: Interpretable rule-based classifier
Random Forest: Ensemble of decision trees
XGBoost: Gradient boosting framework
GLM: Generalized Linear Model (Logistic Regression)
Naive Bayes: Probabilistic classifier
GBM: Generalized Boosted Regression Model

Imbalanced Data Handling

Each model is evaluated with and without four different balancing techniques:

ADASYN: Adaptive Synthetic Sampling
SMOTE: Synthetic Minority Oversampling Technique
MWMOTE: Majority Weighted Minority Oversampling Technique
ROSE: Random Over-Sampling Examples

📊 Performance Metrics

Models are evaluated using:

Accuracy: Overall classification accuracy
Sensitivity (Recall): True positive rate
Specificity: True negative rate
Precision: Positive predictive value
F1-Score: Harmonic mean of precision and recall
F2-Score: Weighted F-score prioritizing recall
ROC-AUC: Area under ROC curve
PR-AUC: Area under Precision-Recall curve

📈 Results

Comprehensive results comparing all model configurations are available in:

results/final_output.xlsx: Detailed performance metrics
results/results.rds: R data format

👥 Authors

SSSIHL-CAS Project Team

📝 License

This project is proprietary. To access, please send request to Authors of the article.

🙏 Acknowledgments

SSSIHL for institutional support
Insurance industry partners for data and domain expertise

📧 Contact

For questions or access requests, please contact the project authors.

Note: Due to the sensitive nature of fraud detection data, the data has been masked.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
docs		docs
results		results
src/features		src/features
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
main.Rmd		main.Rmd
main.html		main.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Health Insurance Fraud Detection

📋 Project Overview

Key Objectives

Key Features

📁 Project Structure

🚀 Getting Started

Prerequisites

Installation

Usage

🔍 Fraud Detection Triggers

🤖 Machine Learning Models

Algorithms Evaluated

Imbalanced Data Handling

📊 Performance Metrics

📈 Results

👥 Authors

📝 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Languages

casact/CAS-Project-Health-Insurance-Fraud-Detection

Folders and files

Latest commit

History

Repository files navigation

Health Insurance Fraud Detection

📋 Project Overview

Key Objectives

Key Features

📁 Project Structure

🚀 Getting Started

Prerequisites

Installation

Usage

🔍 Fraud Detection Triggers

🤖 Machine Learning Models

Algorithms Evaluated

Imbalanced Data Handling

📊 Performance Metrics

📈 Results

👥 Authors

📝 License

🙏 Acknowledgments

📧 Contact

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages