This project focuses on analyzing and predicting customer churn for a telecommunications company using machine learning techniques. The analysis involves data preprocessing, exploratory data analysis (EDA), and building a predictive model using PySpark and scikit-learn.
The dataset, Telco-Customer-Churn.csv, is loaded into a Pandas DataFrame for initial exploration.
- The
customerIDcolumn is dropped as it is not relevant for analysis. - The
TotalChargescolumn, initially of typeobject, is converted to a numeric type to facilitate analysis. Missing values in this column are imputed using the mean value.
Features are categorized into categorical and continuous types. Categorical features include gender, Partner, Dependents, etc., while continuous features include SeniorCitizen, tenure, MonthlyCharges, and TotalCharges.
A count plot is used to visualize the distribution of churn, showing that 73% of customers did not churn, while 27% did.
Bar charts are used to visualize the frequency of categorical features relative to churn. For example, the analysis shows higher churn rates for customers with "Fiber optic" internet service compared to those with "DSL".
Histograms are used to visualize the distribution of continuous features, such as tenure and MonthlyCharges, for churned and non-churned customers.
- The dataset is split into training (80%) and testing (20%) sets.
StringIndexeris used to convert categorical features into a format suitable for machine learning algorithms.VectorAssembleris used to assemble all feature columns into a single vector.
A RandomForestClassifier is used to build the predictive model due to its ability to handle non-linear relationships and interactions between features. The model is trained using the training dataset.
The model's performance is evaluated using the Area Under the ROC Curve (AUC-ROC) metric, which is found to be 0.66913. The model is saved for future use, and the predictions are exported to CSV files for further analysis.
This project demonstrates a comprehensive approach to customer churn analysis and prediction using machine learning. By preprocessing the data, performing EDA, and building a predictive model, valuable insights are gained into customer behavior and churn patterns. The model can be used to identify at-risk customers and inform retention strategies.