Skip to content

Sayed-Hossein-Hosseini/Genetic_Algorithm_for_Optimized_Customer_Classification

Repository files navigation

πŸš€ Genetic Algorithm for Optimized Customer Classification

πŸ“Œ Project Overview

This project applies Genetic Algorithms (GA) to optimize customer classification by selecting the most relevant features. The dataset undergoes preprocessing, and various GA strategies (selection, crossover, mutation) are tested to enhance classification accuracy.

πŸ“š Libraries Used

This project leverages the following Python libraries:

  • NumPy: Fast numerical calculations and matrix operations.
  • Pandas: Data manipulation and preprocessing.
  • Seaborn: Advanced data visualization.
  • Matplotlib: Data visualization and graph plotting.
  • Scikit-Learn: Machine learning utilities for classification and model evaluation.
  • Imbalanced-Learn: Handling imbalanced datasets.
  • Itertools: Built-in Python module for working with iterators and combinatorial operations.
  • Random: Built-in Python library for randomization in GA processes.
  • Tabulate: Creating formatted tables for result presentation.

πŸ“‚ Datasets Introduction

The dataset used in this project focuses on customer segmentation classification and includes various features related to demographic and behavioral aspects of customers. The goal is to classify customers into different segments based on their attributes.

πŸ”Ή Input Data

The dataset consists of the following key input features:

  1. Customer Demographic Information:

    • Gender: Categorical feature indicating "Male" or "Female".
    • Ever Married: Binary feature ("Yes" or "No") indicating marital status.
    • Age: Numerical feature representing customer age.
    • Graduated: Binary feature ("Yes" or "No") indicating if the customer has completed higher education.
    • Profession: Categorical feature representing the customer's profession (e.g., "Engineer", "Artist", "Doctor", etc.).
    • Work Experience: Numerical feature indicating the number of years the customer has worked.
    • Family Size: Numerical feature representing the total number of family members.
  2. Customer Behavioral Information:

    • Spending Score: Categorical feature indicating the spending behavior of the customer ("Low", "Medium", "High").
    • Var_1: A categorical feature representing an unknown segmentation attribute, which might indicate population grouping.
  3. Target Variable:

    • Segmentation: The class label categorizing customers into four segments ("A", "B", "C", "D").

The dataset is preprocessed to handle missing values, remove outliers, encode categorical features, and normalize numerical features before applying the Genetic Algorithm for optimized feature selection and classification.

πŸ›  Data Preprocessing

Data preprocessing ensures that the dataset is clean and suitable for feature selection and model training. The following steps are performed:

  • Handling Missing Values: Missing values are imputed using appropriate strategies:

    • Numerical Features: Imputed using the RandomForestRegressor or RandomForestClassifier model to predict missing values based on existing data patterns.
    • Categorical Features: Filled using the most frequent category (mode) or a separate "Unknown" category.
  • Removing Outliers: Outliers are detected and removed to enhance model stability:

    • Z-score Method: Identifies values beyond a specified standard deviation threshold.
    • Interquartile Range (IQR): Removes values outside the acceptable range.
  • Feature Encoding: Categorical variables are transformed for machine learning algorithms:

    • One-Hot Encoding: Converts categorical features into binary vectors for nominal categories.
    • Label Encoding: Assigns numerical values to categorical classes where order matters.
  • Scaling & Normalization: Ensures numerical features have consistent ranges:

    • Min-Max Scaling: Scales values between 0 and 1 to preserve relationships.
    • Standardization (Z-score Normalization): Centers data with mean 0 and variance 1.

These preprocessing steps enhance feature representation, reduce bias, and improve the performance of the Genetic Algorithm in selecting optimal features for customer segmentation.

🧬 Genetic Algorithm Implementation

Genetic Algorithm (GA) is an evolutionary optimization technique inspired by natural selection. It is used in this project to select the most relevant features for customer classification, thereby improving model efficiency and accuracy.

πŸ”Ή Algorithm Components

  • Chromosome Representation: Each chromosome represents a potential feature subset, where each gene is a binary value (1 for selected, 0 for not selected).

    Example:

    Chromosome: [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
    Features:    A  B  C  D  E  F  G  H  I  J
    

    Here, features A, C, D, F, I, and J are selected.

  • Population Initialization: A set of random chromosomes (feature subsets) is generated to form the initial population.

    Example:

    Population:
    [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
    [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]
    [1, 1, 0, 0, 1, 1, 1, 0, 0, 1]
    
  • Fitness Function: Evaluates the performance of each chromosome using a classification model. The F1-score and accuracy are used as fitness metrics.

    Example:

    Chromosome: [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
    Fitness Score: 0.82
    

πŸ”Ή Selection Strategies

Selection strategies determine which chromosomes will be carried forward to the next generation based on their fitness scores:

  • Tournament Selection: Randomly selects a group of individuals, and the best among them is chosen for reproduction.

    Example:

    Tournament Selection:
    Candidates: 
    [0.72, 0.65, 0.82]
    Selected: 0.82 (Best fitness)
    
  • Roulette Wheel Selection: Assigns a probability of selection to each chromosome based on its fitness, ensuring better solutions have a higher chance of reproduction.

    Example:

    Roulette Wheel Probabilities:
    Chromosome 1 (0.82) -> 40%
    Chromosome 2 (0.72) -> 35%
    Chromosome 3 (0.65) -> 25%
    Selected: Chromosome 1
    

πŸ”Ή Crossover Strategies

Crossover is the genetic operation that combines two parent chromosomes to create offspring, encouraging diversity in the population:

  • Multi-point Crossover: Multiple points are randomly chosen to exchange genetic material between parents.

    Example:

    Parent 1: [1, 0, 1, | 1, 0, 1, | 0, 0, 1, 1]
    Parent 2: [0, 1, 0, | 0, 1, 0, | 1, 1, 0, 0]
    Offspring: [1, 0, 1, | 0, 1, 0, | 0, 0, 1, 1]
    
  • Uniform Crossover: Each gene is independently inherited from one of the parents with equal probability.

    Example:

    Parent 1:  [1, 0, 1, 1, 0, | 1, 0, 0, 1, 1]
    Parent 2:  [0, 1, 0, 0, 1, | 0, 1, 1, 0, 0]
    Offspring: [1, 0, 1, 1, 0, 0, 1, 1, 0, 0]
    

πŸ”Ή Mutation Strategy

Mutation introduces randomness to avoid premature convergence to local optima. It works by flipping the bit values of selected genes (changing 0 to 1 or vice versa), ensuring genetic diversity.

Example:

Before Mutation: [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
After Mutation:  [1, 1, 1, 0, 0, 1, 0, 0, 1, 1]

πŸ”Ή Stopping Criteria

The algorithm continues evolving until one of the following conditions is met:

  • The population converges, meaning there is minimal improvement in the fitness score over multiple generations.
  • A predefined maximum number of generations is reached.

This Genetic Algorithm framework efficiently identifies the most relevant features, optimizing the classification performance while reducing the dataset’s dimensionality.

πŸ“ˆ Feature Selection & Model Evaluation

Feature selection plays a crucial role in improving classification accuracy and reducing computational complexity. This project utilizes a Genetic Algorithm to identify the most informative features, removing irrelevant or redundant ones.

  • Feature Importance Analysis: The selected features are analyzed to determine their impact on classification performance using metrics such as information gain and correlation analysis.

  • Distribution Charts for Key Features: Visual representations such as histograms and boxplots are used to explore the statistical distribution of important features, ensuring they contribute effectively to segmentation.

  • Performance Comparison: Different feature subsets are evaluated using cross-validation to compare classification accuracy, F1-score, and other metrics. The GA-selected features are compared against full feature sets and randomly selected subsets to demonstrate the effectiveness of the approach.

  • Optimal Feature Subset Selection: After multiple iterations, the best-performing feature subset is identified, balancing model complexity and prediction accuracy.

These evaluation methods ensure that the final selected features provide the highest predictive power while reducing noise in the dataset.

  • Feature Importance Analysis: Identifies the most influential features contributing to classification.
  • Distribution Charts for Key Features: Visualizes feature distributions to understand their impact.
  • Performance Comparison of different selection & crossover combinations to determine the optimal approach.

πŸ“Š Results & Insights

The Genetic Algorithm-based feature selection was tested on different feature subsets to evaluate its impact on classification performance. The results below highlight the effectiveness of selecting optimal features versus using all features.

πŸ”₯ Final Results Table

Number of Features Avg F1 (CV) Avg Acc (CV) Test F1 Test Acc
Top Three Features 50.26 49.95 31.63 31.37
Top Five Features 49.74 49.95 33.18 33.16
Top Seven Features 49.86 49.97 32.57 32.7
All Features 49.05 49.11 31.66 31.45

πŸ” Key Insights

  • Top Five Features achieved the best Test Accuracy (33.16%) and Test F1 Score (33.18%).
  • Using all features does not necessarily lead to the best results, highlighting the importance of feature selection.
  • Tournament Selection + Uniform Crossover provided competitive results in feature selection.

🀝 Contributors

  • Sayyed Hossein Hosseini DolatAbadi
  • Mohammad Amin Nasiri

πŸ›‘ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“¬ Contact

For questions or collaborations, reach out via:

About

This project involves implementing, evaluating, and analyzing a Genetic Algorithm (GA) for selecting the best features to improve customer classification models in retail. The goal is to enhance the accuracy and performance of the classification models.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors