This repository contains my solution for Task 5 of the AI & ML Internship. The focus of this task is to explore tree-based models—specifically Decision Trees and Random Forests—using the Heart Disease dataset. The goal is to understand model interpretability, overfitting control, ensemble methods, and feature importance.
Learn tree-based models for classification & regression.
| File Name | Description |
|---|---|
heart.csv |
Dataset used for classification |
tree_based_models.ipynb |
Jupyter Notebook with all steps: training, tuning, visualization |
screenshots/ |
Folder containing plots of trees and feature importances |
README.md |
Project documentation |
-
Data Exploration
Performed basic EDA using pandas to understand structure, check for null values, and examine the target distribution. -
Train/Test Split
Divided data into features (X) and labels (y), with an 80/20 split for training and testing. -
Decision Tree Classifier
- Trained a base decision tree using
DecisionTreeClassifier. - Visualized the tree using
plot_treeto interpret splits and predictions.
- Trained a base decision tree using
-
Controlling Overfitting
- Tuned
max_depthto limit the depth of the tree and avoid overfitting. - Compared train and test accuracy across different depths.
- Tuned
-
Random Forest Classifier
- Trained a
RandomForestClassifierto improve accuracy and reduce overfitting. - Compared performance with the decision tree model.
- Trained a
-
Feature Importance
- Extracted feature importances using
.feature_importances_ - Visualized top contributing features using bar plots.
- Extracted feature importances using
-
Model Evaluation
- Used accuracy scores, confusion matrices, and cross-validation (
cross_val_score) to assess model performance and robustness.
- Used accuracy scores, confusion matrices, and cross-validation (
- Python 3.12
pandas,numpy— data handlingmatplotlib,seaborn— plottingscikit-learn— model building and evaluation
This task deepened my understanding of tree-based models. I learned:
- How decision trees make splits based on feature thresholds.
- How to visualize trees for better interpretability.
- Why random forests (via bagging) outperform single trees.
- How feature importance can guide insights in medical data.
- The trade-off between bias and variance when tuning tree depth.
- Clone the repository:
git clone https://github.com/anmolthakur74/task-5-tree-models.git cd task-5-tree-models
2. Install dependencies:
```bash
pip install pandas numpy matplotlib seaborn scikit-learn
- Open the notebook:
jupyter notebook tree_based_models.ipynbAnmol Thakur
GitHub: anmolthakur74