Cars price prediction
This repo is for a job offer coding task and the task was as follow:
You need to build a model that predicts the price of the car using Python. Please make sure to use at least 3 different methods/algorithms to predict the price and report which one will be most accurate.
the dataset contains 8 features 301 row
- Car Name: Nominal
| Missing | Unique |
|---|---|
| 1 | 47 |
- Year: Numeric
| Min | Max | Mean | StdDev |
|---|---|---|---|
| 2003 | 2018 | 2013 | 2.891 |
- Mileage: Numeric
| Min | Max | Mean | StdDev |
|---|---|---|---|
| 500 | 500000 | 36947.206 | 38886.884 |
- Fuel type: Nominal
| Label | Count |
|---|---|
| Petrol | 239 |
| Diesel | 60 |
| CNG | 2 |
- Seller Type: Nominal
| Label | Count |
|---|---|
| Dealer | 195 |
| Individual | 106 |
- Transmission type: Nominal
| Label | Count |
|---|---|
| Manual | 261 |
| Automatic | 40 |
- Previous owners: Numeric
| Min | Max | Mean | StdDev |
|---|---|---|---|
| 0 | 1 | 0.043 | 0.248 |
- Selling price: Numeric
| Min | Max | Mean | StdDev |
|---|---|---|---|
| 0.1 | 35 | 4.643 | 5.081 |
Bellow a histogram for all features
- Remove Outliers
- Fill the Missing value
- Convert to numeric representation
- Add new features
- Handel Car's name
- Feature selection
- Shallow learning (Machine Learning)
- Linear regression
- Random forest regressor
- Support vector regression
- Deep learning
- CNN
- Baseline model
- Standardizing Dataset
- Tuning the network
- CNN
| Preprocessing | Models | MSE | R2 | RMSE | Mean |
|---|---|---|---|---|---|
| Without Names | LR | 4.096 | 0.668 | 4.096 | |
| One-Hot-Encoding | LR | 4.105 | 0.731 | 4.105 | |
| Without Names | RFR | 4.205 | 0.660 | 4.205 | |
| One-Hot-Encoding | RFR | 2.723 | 0.821 | 2.723 | |
| Without Names | SVM | 12.037 | -0.026 | 0.026 | |
| One-Hot-Encoding | SVM | 15.883 | -0.042 | 15.883 | |
| One-Hot-Encoding | CNN | 22.26 | -27.59 | ||
| One-Hot-Encoding | CNN | 4.91 | -4.05 | ||
| One-Hot-Encoding | CNN | 5.26 | -4.19 |
As we can see the Random Forest Regression gives a high R2 score 82% with Mean square error 2.723 that is mean it is the closest to the line best fit.
The plot of the Random forest regressor model which achieved the highest accuracy
Graph shows predictions VS the actual which miss the actual values at some places
R-Squared & RMSE by number of features
- Libraries
- Load dataset
- Exploratory Data Analysis
- Selling price
- Convert to numeric representation
- Missing values
- year
- Mileage
- Care Name
- Feature Extraction
- Correlation matrix
- Split the data (training\testing)
- Models
- Shallow Learning (ML)
- Linear regression
- Feature Selection using RFE & K-Fold Cross Validation
- Random force Regressor
- Feature Selection using RFE & K-Fold Cross Validation
- Support vector regression
- Linear regression
- Deep learning
- CNN
- Shallow Learning (ML)
- Conda setup
.png)


