Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means

The task for Day 14 is to use K-Means clustering to segment grocery store customers based on their purchasing history. Clustering helps businesses identify customer groups with similar buying habits, making it easier to create targeted marketing strategies and personalized customer experiences.

If you want to see the code, you can find it here: GIT REPO.

Understanding the Data

We used the Wholesale Customers Dataset from UCI’s Machine Learning Repository. The dataset includes features like:

  • Fresh: Annual spending on fresh products (fruit, vegetables, etc.)
  • Milk: Annual spending on milk products
  • Grocery: Annual spending on groceries
  • Frozen: Annual spending on frozen products
  • Detergents_Paper: Annual spending on detergents and paper
  • Delicatessen: Annual spending on delicatessen products

We used these features to cluster customers into different segments based on their purchasing behavior.

Download and place it in dataset directory of your project.

Code Workflow

Here’s the step-by-step process followed:

  • Load the Data
  • Preprocess the Data
  • Apply K-Means Clustering
  • Use the Elbow Method to Find the Optimal K
  • Visualize the Clusters with PCA

Step 1: Load the Data

We started by loading the Wholesale Customers Dataset into a pandas DataFrame:

import pandas as pd

data = pd.read_csv('dataset/wholesale_customers_data.csv')
print(data.head())

Step 2: Preprocess the Data

We checked for missing values and then scaled the data to ensure all features are on the same scale. K-Means is sensitive to large variances, so we applied StandardScaler to normalize the data:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

Step 3: Apply the K-Means Algorithm

To determine the optimal number of clusters (K), we used the Elbow Method. This method looks at the sum of squared distances from each point to its assigned cluster center (inertia) and plots it against various values of K. The “elbow” of the curve is the optimal number of clusters.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

sum_sq_dist_pt = [] # Sum of squared distances of each K
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
    kmeans.fit(data_scaled)
    sum_sq_dist_pt.append(kmeans.inertia_)

# Plot the Elbow curve
plt.figure(figsize=(7, 5))
plt.plot(range(1, 11), sum_sq_dist_pt, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()
Day 14 Elbow Curve

From the elbow plot, we observed that K=3 was a good choice.

Step 4: Train the Model

We selected K=3 based on the Elbow Method and retrained the K-Means algorithm on the scaled data:

k_optimal = 3
kmeans = KMeans(n_clusters=k_optimal, random_state=42, n_init='auto')
kmeans.fit(data_scaled)

# Add the cluster label to the original dataset
data['cluster'] = kmeans.labels_
print(data.head())

Step 5: Visualize the Clusters Using PCA

Using Principal Component Analysis (PCA), we reduced the high-dimensional data to two principal components for visualization:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)

plt.figure(figsize=(10,7))
plt.scatter(data_pca[:, 0], data_pca[:, 1], c=kmeans.labels_, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering with PCA')
plt.show()
Day 14 Cluster Plot Chart

Model Performance

Using K-Means Clustering and PCA visualization, we successfully segmented grocery store customers into 3 distinct clusters based on their purchase behavior. Each cluster represents a unique group of customers with similar spending patterns, which can be useful for targeted marketing or customer service strategies.

Gratitude

This project was a great introduction to unsupervised learning with K-Means and using the Elbow Method to find the optimal number of clusters. Learning how to visualize high-dimensional data with PCA also deepened my understanding of data representation. Looking forward to Day 15!

Stay tuned!

Posts in this series