Day 24 - K-Means Clustering to Segment Customers Based on Behavior

Today’s challenge was to use K-Means clustering to segment customers based on their behavioral patterns such as age, annual income, and spending score. This technique is widely used in customer segmentation to create distinct groups of customers with similar behaviors, helping businesses tailor marketing strategies and improve customer experience.

If you want to see the code, you can find it here: GIT REPO.

Dataset:

For this project, I used the Mall Customer Segmentation Dataset from Kaggle. The dataset contains the following columns:

  • CustomerID: Unique identifier for each customer.
  • Gender: Gender of the customer (Male/Female).
  • Age: Customer’s age.
  • Annual Income (k$): The customer’s annual income in thousands.
  • Spending Score (1-100): A score assigned based on the customer’s spending behavior.

Steps for Implementation:

  1. Data Preprocessing:

    • Since K-Means relies on numerical features, we will drop the CustomerID and Gender columns (or encode the Gender column if you’d like to include it).
    • Scale the features, as K-Means clustering is sensitive to the scale of data.
  2. Applying K-Means Clustering:

    • Perform K-Means clustering using the features Age, Annual Income, and Spending Score.
    • Visualize the clusters.
    • Use the Elbow Method to find the optimal number of clusters.
  3. Interpret the Clusters:

    • Understand what each cluster represents (e.g., high spenders vs. low spenders).

Here’s how you can implement it:


import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns


# Step 1: Load the data
data = pd.read_csv('dataset/Mall_Customers.csv')
# print(data.info())
# print(data.head())
# Output
#  #   Column                  Non-Null Count  Dtype
# ---  ------                  --------------  -----
#  0   CustomerID              200 non-null    int64
#  1   Gender                  200 non-null    object
#  2   Age                     200 non-null    int64
#  3   Annual Income (k$)      200 non-null    int64
#  4   Spending Score (1-100)  200 non-null    int64
# dtypes: int64(4), object(1)
# memory usage: 7.9+ KB
# None
#    CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
# 0           1    Male   19                  15                      39
# 1           2    Male   21                  15                      81
# 2           3  Female   20                  16                       6
# 3           4  Female   23                  16                      77
# 4           5  Female   31                  17                      40


# Step 2: Data Preprocessing
# Drop CustomerID as it is not useful for clustering.
data.drop('CustomerID', axis=1)

# Encode Gender, male - 0, female - 1
data['Gender'] = data['Gender'].map({'Male': 0, 'Female': 1})

# Handle missing value.
#print(data.isnull().sum()) # We don't have any missing value.

# Step 3: Create features for Cluster
X = data[['Gender', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Step 4: Apply Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 5: Apply Elbow method to find the optimal number of clusters (K)
inertia = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42, n_init='auto')
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)
    # What k-means++ Does: The k-means++ initialization method is a smart way of choosing the initial centroids.
    # Instead of choosing random points, k-means++ spreads out the initial centroids as follows:
    #
    # It first randomly selects one centroid from the data points.
    # Then, for each remaining centroid, it selects the next centroid from the data points that are far from the
    # already chosen centroids. The probability of choosing a data point as a centroid is proportional to its
    # distance from the nearest already chosen centroid.

# Plot the Elbow curve
plt.figure(figsize=(7, 5))
plt.plot(range(1, 11), inertia, marker='o', linestyle='--')
plt.title('Elbow Curve')
plt.xlabel('K value')
plt.ylabel('Inertia')
plt.show()

# Step 6: Apply K Means on the optimal number of clusters (K = 4)
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, init='k-means++', random_state=42, n_init='auto')
y_kmeans = kmeans.fit_predict(X_scaled)

# Step 7: Visualize the clusters
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_scaled[:, 2], y=X_scaled[:, 3], hue=y_kmeans, palette="viridis", s=100)
# We are plotting the Annual Income (x-axis) against the Spending Score (y-axis) for each customer.
plt.title('Customer Segments (K Means Cluster)')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()

# Step 8: Add the clusters information for each row to original data
data['Clusters'] = y_kmeans
print(data)

# Output
#      CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)  Clusters
# 0             1       0   19                  15                      39         2
# 1             2       0   21                  15                      81         2
# 2             3       1   20                  16                       6         3
# 3             4       1   23                  16                      77         3
# 4             5       1   31                  17                      40         3
# ..          ...     ...  ...                 ...                     ...       ...
# 195         196       1   35                 120                      79         3
# 196         197       1   45                 126                      28         1
# 197         198       0   32                 126                      74         2
# 198         199       0   32                 137                      18         1
# 199         200       0   30                 137                      83         2

Explanation:

  1. Feature Scaling:
    • We scale the features because K-Means uses Euclidean distance, and differences in the scale of features can impact clustering results.
  2. Elbow Method:
    • We use the Elbow Method to determine the optimal number of clusters by plotting the within-cluster sum of squares (WCSS) and looking for the “elbow” point, where adding more clusters doesn’t significantly improve the fit.
  3. K-Means Clustering:
    • We apply K-Means clustering with the number of clusters determined from the elbow plot (in this case, it’s set to 5).
  4. Visualization:
    • A scatter plot is used to visualize the clusters formed by K-Means, where Annual Income and Spending Score are the features plotted against each other, and different colors represent different customer segments.
  5. Cluster Assignment:
    • Finally, the Cluster labels are added to the original dataset, allowing you to analyze and interpret the customer segments.

See how the graph looks:

Day 24 Elbow Graph Day 24 Scatter Plot

Gratitude

This is the second problem involving KMeans. I felt confident with the topics and I’m really excited about my progress. Looking forward to the next day!

Stay Tuned!

Posts in this series