Defect Predictor using RepoMiner

This project provides an implementation of a Defect Predictor by leveraging RepoMiner and PyDriller to analyze commit history and predict the risk associated with files in a given revision. It extends the RepoMiner example and includes risk calculation based on commit history.

Setup Instructions

Installation

Follow the steps below to set up and run the project

1. Clone the repo

First, you need to clone our repo in order to use the project, preferably to your desired repository.

git clone https://github.com/muksaw/ML-Based-Defect-Predictor.git

2. Configuration Options

Navigate to our config.json. The config.json file allows you to customize:

{
    "url_to_repo": "Repository URL to analyze",
    "clone_repo_to": "Local path to clone the repository",
    "branch": "Branch to analyze",
    "from_date": "Start date for analysis (YYYY-MM-DD)",
    "to_date": "End date for analysis (YYYY-MM-DD)",
    "confidence_threshold": "Base confidence threshold (0.0-1.0)",
    "model_path": "Path to save/load the trained model",
    "file_extensions": ["Extensions to include in analysis"],
    "max_commits": "Maximum number of commits to analyze",
    "time_decay_factor": "Half-life for time weighting in days"
}

Based on these parameters, please input what you would like, and then ensure you have certain files you know are buggy in the ground truth, matching the CSV format (columns included below).

risky_files,from_date,to_date,github_url,branch

3. Build the docker file

Create a docker image (see Dockerfile) for the project:

docker build -t defect-predictor .

4. Run the project

Run the docker container from the image and mount the outputs directory to save results:

docker run --rm -it defect-predictor

Understanding Results and Risk Scores

The model uses two primary metrics to evaluate file risk:

Confidence Score (0-1)

The machine learning model's confidence that a file contains bugs:

0.5-0.7: Low confidence
0.7-0.85: Medium confidence
0.85-1.0: High confidence

Relative Risk Score

A normalized metric that compares each file's risk against repository averages:

<0.5: Low Risk - significantly safer than average
0.5-1.0: Medium-Low Risk - somewhat safer than average
1.0-1.5: Medium Risk - around average
1.5-3.0: Medium-High Risk - higher risk than average
3.0: High Risk - significantly higher risk than average

Key Features

Machine Learning-Based Defect Prediction: Uses Random Forest classifier to identify potentially buggy files
Time-Weighted Analysis: Gives more weight to recent commits and bug fixes
Relative Risk Scoring: Compares each file against repository averages
Adaptive Confidence Threshold: Automatically adjusts prediction sensitivity based on time span
Risk Categorization: Categorizes files into risk levels (Low to High)
Test File Exclusion: Excludes test files from analysis

Overview of files

ml_defect_predictor.py:

Main defect prediction algorithm with machine learning and risk scoring

ml_harness.py

Script to run the defect predictor and display results

requirements.txt

Lists all the dependencies and libraries required to run the project seamlessly.

ground_truth.csv

Provides a reference dataset containing start and end dates along with the modified files. It is used to compare expected results with actual outputs.

config.json

Allows you to specify the repository to be tested, the local path where it should be cloned, the branch to analyze, and the start and end dates.

Docker

Sets up the necessary environment with Python, Git, dependencies, and your project files to run the defect predictor seamlessly inside a container.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
outputs		outputs
Defect Prediction - Nicholas and Mukul.pdf		Defect Prediction - Nicholas and Mukul.pdf
Dockerfile		Dockerfile
Final.md		Final.md
README.md		README.md
Results.xlsx		Results.xlsx
Sprint2.md		Sprint2.md
Sprint3.md		Sprint3.md
Sprint4.md		Sprint4.md
Sprint5.md		Sprint5.md
Sprint6.md		Sprint6.md
config.json		config.json
ground_truth.csv		ground_truth.csv
lotsoutput.png		lotsoutput.png
ml_defect_predictor.py		ml_defect_predictor.py
ml_harness.py		ml_harness.py
requirements.txt		requirements.txt
test_ml_defect_predictor.py		test_ml_defect_predictor.py
test_ml_harness.py		test_ml_harness.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Defect Predictor using RepoMiner

Setup Instructions

Installation

1. Clone the repo

2. Configuration Options

3. Build the docker file

4. Run the project

Understanding Results and Risk Scores

Confidence Score (0-1)

Relative Risk Score

Key Features

Overview of files

ml_defect_predictor.py:

ml_harness.py

requirements.txt

ground_truth.csv

config.json

Docker

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Defect Predictor using RepoMiner

Setup Instructions

Installation

1. Clone the repo

2. Configuration Options

3. Build the docker file

4. Run the project

Understanding Results and Risk Scores

Confidence Score (0-1)

Relative Risk Score

Key Features

Overview of files

ml_defect_predictor.py:

ml_harness.py

requirements.txt

ground_truth.csv

config.json

Docker

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages