In order to run the notebooks in this repository, the following libraries have to be installed:
- Pandas 0.24.2
- Numpy 1.17.4
- pickle
- sqlalchemy
- seaborn 0.9.0
- scikit-learn 0.21.2
- nltk
The purpose of this project is to create a classifier for classifying input messages. The input contains messages received in disaster zones. The classifier is used to classify these input messages into categories. In turn, the predicted category can be used to route the message to the appropriate agency. The intended benefit is prompt response to the incoming messages
This project demonstrates:
- Use of Pipeline to execute ML workflow
- The workflow steps consist of a) reading data and cleaning data b) training a classifier c) evaluating the trained classifier
The repository consists of 2 main folders -- Data & Code The Data folder has:
- 2 CSV files: disaster_messages.csv and disaster_categories.csv are the input data files. These files contain the messages received from disaster regions and the corresponding categories of the messages, respectively
- Database file: DisasterResponse.db is a SQLite database. This database has a main table (Message_Category). This table contains the clean data [X: Tokenized message, y: Categories] used to train and evaluate the classifier
The Code folder has:
- Data_ETL.ipynb & process_data.py: These are the Jupyter notebook and the corresponding python script for reading, cleaning and loading of the input data into a database
- ML_NLP_Workflow.ipynb and train_classifier.py: These are the Jupyter notebook and corresponding python script for training the classifier. This script utilizes GridSearch among RandomForest and KNeighors classifiers.
- Model_Evaluation.ipnyb: This notebook analyzes the performance of the classifier. The output categories are separated into 2 sets [prominent and other] based on the frequency of their occurence in the dataset
- run.py: This script loads the trained model and presents the model as webapp. The location of the trained model is the used in the script
Instructions:
- process_data.py: This script accepts 3 input parameters -
a) messages_filepath (str): Location of the csv file containing the disaster messages
b) categories_filepath (str): Location of the csv file containing the categories for the disaster messages
c) database_filepath (str): String containing the location and name of the database. The pandas Dataframe with the transformed data will be saved as a table in this database
d) This script as run on the terminal - cd NLP_Project ## go to the location of the repository
python ./Code/process_data.py ./Data/disaster_messages.csv ./Data/disaster_categories.csv ./Data/DisasterResponse1.db - train_classifier.py: This script accepts 2 input parameters -
a) database_filepath (str): String containing the location and name of the database. This database has the input data for training (as a table)
b) model_filepath (str): String containing the location where the trained model should be stored (as a pickle file)
c) This script as run on the terminal -
python ./Code/train_classifier.py ./Data/DisasterResponse1.db ./Code/cv_model1.sav - run.py: This script does not have any input parameters. Before running the location of the saved model has to be validated in the Run.py python script
python ./Code/run.py
Thanks to Python open source community for creating valuable libraries used in this project.
This project uses normalized dataset of truckload shipments
Apache license