In order to run the notebooks in this repository, the following libraries have to be installed:
- Pandas 0.24.2
- Numpy 1.17.4
- seaborn 0.9.0
- scikit-learn 0.21.2
- tensorflow 2.6.0
- Flask 2.1.3
- Plotly 5.11
The purpose of this project is to cluster the input data. The input data consists of data pertaining to a particular commodity. Various clustering methods have been compared and suitable clustering approach has been selected based on the comparison results.
This project demonstrates:
- Extensive exploratory data analysis
- Use of tensorflow.keras to build an auteencoder
- Use of evaluation metrics to compare clustering approaches
- Use of CRISP-DM steps
The repository consists of 2 main folders -- Data & Code
The Data folder has:
- cmd_attributes_v3_upload.csv: This file has the pre-processed input data
The Code folder has:
- CMD_Clustering_Steps1_2.ipynb: This is a jupyter notebook showing the first 2 steps of CRISP-DM (Business & data understanding)
- CMD_Clustering_Steps3_5.ipynb: This is a jupyter notebook showing steps 3, 4 and 5 of CRISP-DM (Data preparation, modeling & evaluation)
- cmd_data_output.csv: csv file with cluster information and is used in the python script
- CMD_Clustering_Deploy_Step6.py: This is a python script for a web-app showing details about the clusters
- index.html & chart_4.html: HTML templates used in the webapp
- Clustering_Project_v4.pdf: Document describing the overall approach for clustering
Steps for running the python script:
- cd SKU_Clusters ## go to the location of the repository
- python ./Code/CMD_Clustering_Deploy_Step6.py
Thanks to Python open source community for creating valuable libraries used in this project.