This project applies Genetic Algorithms (GA) to optimize customer classification by selecting the most relevant features. The dataset undergoes preprocessing, and various GA strategies (selection, crossover, mutation) are tested to enhance classification accuracy.
This project leverages the following Python libraries:
- NumPy: Fast numerical calculations and matrix operations.
- Pandas: Data manipulation and preprocessing.
- Seaborn: Advanced data visualization.
- Matplotlib: Data visualization and graph plotting.
- Scikit-Learn: Machine learning utilities for classification and model evaluation.
- Imbalanced-Learn: Handling imbalanced datasets.
- Itertools: Built-in Python module for working with iterators and combinatorial operations.
- Random: Built-in Python library for randomization in GA processes.
- Tabulate: Creating formatted tables for result presentation.
The dataset used in this project focuses on customer segmentation classification and includes various features related to demographic and behavioral aspects of customers. The goal is to classify customers into different segments based on their attributes.
The dataset consists of the following key input features:
-
Customer Demographic Information:
- Gender: Categorical feature indicating "Male" or "Female".
- Ever Married: Binary feature ("Yes" or "No") indicating marital status.
- Age: Numerical feature representing customer age.
- Graduated: Binary feature ("Yes" or "No") indicating if the customer has completed higher education.
- Profession: Categorical feature representing the customer's profession (e.g., "Engineer", "Artist", "Doctor", etc.).
- Work Experience: Numerical feature indicating the number of years the customer has worked.
- Family Size: Numerical feature representing the total number of family members.
-
Customer Behavioral Information:
- Spending Score: Categorical feature indicating the spending behavior of the customer ("Low", "Medium", "High").
- Var_1: A categorical feature representing an unknown segmentation attribute, which might indicate population grouping.
-
Target Variable:
- Segmentation: The class label categorizing customers into four segments ("A", "B", "C", "D").
The dataset is preprocessed to handle missing values, remove outliers, encode categorical features, and normalize numerical features before applying the Genetic Algorithm for optimized feature selection and classification.
Data preprocessing ensures that the dataset is clean and suitable for feature selection and model training. The following steps are performed:
-
Handling Missing Values: Missing values are imputed using appropriate strategies:
- Numerical Features: Imputed using the RandomForestRegressor or RandomForestClassifier model to predict missing values based on existing data patterns.
- Categorical Features: Filled using the most frequent category (mode) or a separate "Unknown" category.
-
Removing Outliers: Outliers are detected and removed to enhance model stability:
- Z-score Method: Identifies values beyond a specified standard deviation threshold.
- Interquartile Range (IQR): Removes values outside the acceptable range.
-
Feature Encoding: Categorical variables are transformed for machine learning algorithms:
- One-Hot Encoding: Converts categorical features into binary vectors for nominal categories.
- Label Encoding: Assigns numerical values to categorical classes where order matters.
-
Scaling & Normalization: Ensures numerical features have consistent ranges:
- Min-Max Scaling: Scales values between 0 and 1 to preserve relationships.
- Standardization (Z-score Normalization): Centers data with mean 0 and variance 1.
These preprocessing steps enhance feature representation, reduce bias, and improve the performance of the Genetic Algorithm in selecting optimal features for customer segmentation.
Genetic Algorithm (GA) is an evolutionary optimization technique inspired by natural selection. It is used in this project to select the most relevant features for customer classification, thereby improving model efficiency and accuracy.
-
Chromosome Representation: Each chromosome represents a potential feature subset, where each gene is a binary value (1 for selected, 0 for not selected).
Example:
Chromosome: [1, 0, 1, 1, 0, 1, 0, 0, 1, 1] Features: A B C D E F G H I JHere, features A, C, D, F, I, and J are selected.
-
Population Initialization: A set of random chromosomes (feature subsets) is generated to form the initial population.
Example:
Population: [1, 0, 1, 1, 0, 1, 0, 0, 1, 1] [0, 1, 0, 1, 1, 0, 1, 0, 1, 0] [1, 1, 0, 0, 1, 1, 1, 0, 0, 1] -
Fitness Function: Evaluates the performance of each chromosome using a classification model. The F1-score and accuracy are used as fitness metrics.
Example:
Chromosome: [1, 0, 1, 1, 0, 1, 0, 0, 1, 1] Fitness Score: 0.82
Selection strategies determine which chromosomes will be carried forward to the next generation based on their fitness scores:
-
Tournament Selection: Randomly selects a group of individuals, and the best among them is chosen for reproduction.
Example:
Tournament Selection: Candidates: [0.72, 0.65, 0.82] Selected: 0.82 (Best fitness) -
Roulette Wheel Selection: Assigns a probability of selection to each chromosome based on its fitness, ensuring better solutions have a higher chance of reproduction.
Example:
Roulette Wheel Probabilities: Chromosome 1 (0.82) -> 40% Chromosome 2 (0.72) -> 35% Chromosome 3 (0.65) -> 25% Selected: Chromosome 1
Crossover is the genetic operation that combines two parent chromosomes to create offspring, encouraging diversity in the population:
-
Multi-point Crossover: Multiple points are randomly chosen to exchange genetic material between parents.
Example:
Parent 1: [1, 0, 1, | 1, 0, 1, | 0, 0, 1, 1] Parent 2: [0, 1, 0, | 0, 1, 0, | 1, 1, 0, 0] Offspring: [1, 0, 1, | 0, 1, 0, | 0, 0, 1, 1] -
Uniform Crossover: Each gene is independently inherited from one of the parents with equal probability.
Example:
Parent 1: [1, 0, 1, 1, 0, | 1, 0, 0, 1, 1] Parent 2: [0, 1, 0, 0, 1, | 0, 1, 1, 0, 0] Offspring: [1, 0, 1, 1, 0, 0, 1, 1, 0, 0]
Mutation introduces randomness to avoid premature convergence to local optima. It works by flipping the bit values of selected genes (changing 0 to 1 or vice versa), ensuring genetic diversity.
Example:
Before Mutation: [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
After Mutation: [1, 1, 1, 0, 0, 1, 0, 0, 1, 1]
The algorithm continues evolving until one of the following conditions is met:
- The population converges, meaning there is minimal improvement in the fitness score over multiple generations.
- A predefined maximum number of generations is reached.
This Genetic Algorithm framework efficiently identifies the most relevant features, optimizing the classification performance while reducing the datasetβs dimensionality.
Feature selection plays a crucial role in improving classification accuracy and reducing computational complexity. This project utilizes a Genetic Algorithm to identify the most informative features, removing irrelevant or redundant ones.
-
Feature Importance Analysis: The selected features are analyzed to determine their impact on classification performance using metrics such as information gain and correlation analysis.
-
Distribution Charts for Key Features: Visual representations such as histograms and boxplots are used to explore the statistical distribution of important features, ensuring they contribute effectively to segmentation.
-
Performance Comparison: Different feature subsets are evaluated using cross-validation to compare classification accuracy, F1-score, and other metrics. The GA-selected features are compared against full feature sets and randomly selected subsets to demonstrate the effectiveness of the approach.
-
Optimal Feature Subset Selection: After multiple iterations, the best-performing feature subset is identified, balancing model complexity and prediction accuracy.
These evaluation methods ensure that the final selected features provide the highest predictive power while reducing noise in the dataset.
- Feature Importance Analysis: Identifies the most influential features contributing to classification.
- Distribution Charts for Key Features: Visualizes feature distributions to understand their impact.
- Performance Comparison of different selection & crossover combinations to determine the optimal approach.
The Genetic Algorithm-based feature selection was tested on different feature subsets to evaluate its impact on classification performance. The results below highlight the effectiveness of selecting optimal features versus using all features.
| Number of Features | Avg F1 (CV) | Avg Acc (CV) | Test F1 | Test Acc |
|---|---|---|---|---|
| Top Three Features | 50.26 | 49.95 | 31.63 | 31.37 |
| Top Five Features | 49.74 | 49.95 | 33.18 | 33.16 |
| Top Seven Features | 49.86 | 49.97 | 32.57 | 32.7 |
| All Features | 49.05 | 49.11 | 31.66 | 31.45 |
- Top Five Features achieved the best Test Accuracy (33.16%) and Test F1 Score (33.18%).
- Using all features does not necessarily lead to the best results, highlighting the importance of feature selection.
- Tournament Selection + Uniform Crossover provided competitive results in feature selection.
- Sayyed Hossein Hosseini DolatAbadi
- Mohammad Amin Nasiri
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or collaborations, reach out via:
- Email: s.hossein.hosseini1381@gmail.com
- GitHub Issues: Open an issue in the repository.