I began this process by looking at the full data table and removing variables that would not be necessary for creating a model. Unique variables such as ‘company id’ where dropped, as well as variables that would only be available if the lead was converted to a client like ‘converted date’. ‘Created date’ was also removed but it could have been converted and had the month and year extracted to potentially identify trends, but due to the time constraint and other variables I chose to pass on that method for now. There were plenty of null values for most variables, so the numerical ones had the mean imputed and the categorical had ‘other’ imputed. Another method I used in the past was using a KNN classifier to fill missing values for categorical values with high correlation to the target, but it did not seem worth the extra time and computation for this dataset. The data was then looked at to see what the value ranges were as well as the averages. There appeared to be some outliers when looking at how much larger some max values were compared to the mean, and those were addressed later. The categorical values were then label encoded, and I chose this instead of one hot as some variables had many unique values like country or state, so this saved on computation time. The target variable was visualized, and the dataset was slightly imbalanced, with there being fewer positive values for successful leads. A correlation matrix was also created, some values seemed to be correlated to each other and I thought about doing a PCA for dimensionality reduction, but since there was only ~25 variables and categorical data (which does not work ideally with PCA) I let the models filter out the less important variables. The last thing done before testing was splitting the data into features and response and into train and test sets.
To prevent data leakage, I chose to implement my models through a pipeline. Each pipeline started with SMOTE to oversample the positive target and balance the data, then had a robust scaler to remove outliers, a standard scaler to normalize the data and then finally the model. They were fit to the training data using a grid search to find the best parameters and cross validated, then used to predict the test set. I chose to use recall as the scoring method because I think it is important the company would be able to identify all positive leads and potentially sacrifice some wasted time with false positives compared to missing out on false negatives. The 3 models used were logistic regression, gradient boost classifier and random forest classifier. I chose these because they often do well with many weak classifiers to build a strong one, specifically the gradient boost and random forest. These 2 models greatly outperformed the logistic model and either one would be a good choice to be implemented into production. The gradient boost and random forest had accuracy, precision, and recall scores around 0.95 while the logistic regression was 0.65, 0.8, and 0.65. The gradient boost was slightly faster than the random forest which gives it an advantage and had a higher validation score of 0.96 compared to 0.91. The gradient boost model solidified its performance when the confusion matrix was visualized, as it identified 0.97 of all true conversions compared to 0.91 for the random forest.
The clear winner was the XGB/gradient boost classifier which was able to identify 97% of all true positives that would turn into clients.