Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
The task for Day 14 is to use K-Means clustering to segment grocery store customers based on their purchasing history. Clustering helps businesses identify customer groups with similar buying habits, making it easier to create targeted marketing strategies and personalized customer experiences.
If you want to see the code, you can find it here: GIT REPO.
Understanding the Data
We used the Wholesale Customers Dataset from UCI’s Machine Learning Repository. The dataset includes features like:
- Fresh: Annual spending on fresh products (fruit, vegetables, etc.)
- Milk: Annual spending on milk products
- Grocery: Annual spending on groceries
- Frozen: Annual spending on frozen products
- Detergents_Paper: Annual spending on detergents and paper
- Delicatessen: Annual spending on delicatessen products
We used these features to cluster customers into different segments based on their purchasing behavior.
Download and place it in dataset
directory of your project.
Code Workflow
Here’s the step-by-step process followed:
- Load the Data
- Preprocess the Data
- Apply K-Means Clustering
- Use the Elbow Method to Find the Optimal K
- Visualize the Clusters with PCA
Step 1: Load the Data
We started by loading the Wholesale Customers Dataset into a pandas DataFrame:
import pandas as pd
data = pd.read_csv('dataset/wholesale_customers_data.csv')
print(data.head())
Step 2: Preprocess the Data
We checked for missing values and then scaled the data to ensure all features are on the same scale. K-Means is sensitive to large variances, so we applied StandardScaler to normalize the data:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
Step 3: Apply the K-Means Algorithm
To determine the optimal number of clusters (K), we used the Elbow Method. This method looks at the sum of squared distances from each point to its assigned cluster center (inertia) and plots it against various values of K. The “elbow” of the curve is the optimal number of clusters.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
sum_sq_dist_pt = [] # Sum of squared distances of each K
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
kmeans.fit(data_scaled)
sum_sq_dist_pt.append(kmeans.inertia_)
# Plot the Elbow curve
plt.figure(figsize=(7, 5))
plt.plot(range(1, 11), sum_sq_dist_pt, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()
From the elbow plot, we observed that K=3 was a good choice.
Step 4: Train the Model
We selected K=3 based on the Elbow Method and retrained the K-Means algorithm on the scaled data:
k_optimal = 3
kmeans = KMeans(n_clusters=k_optimal, random_state=42, n_init='auto')
kmeans.fit(data_scaled)
# Add the cluster label to the original dataset
data['cluster'] = kmeans.labels_
print(data.head())
Step 5: Visualize the Clusters Using PCA
Using Principal Component Analysis (PCA), we reduced the high-dimensional data to two principal components for visualization:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)
plt.figure(figsize=(10,7))
plt.scatter(data_pca[:, 0], data_pca[:, 1], c=kmeans.labels_, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering with PCA')
plt.show()
Model Performance
Using K-Means Clustering and PCA visualization, we successfully segmented grocery store customers into 3 distinct clusters based on their purchase behavior. Each cluster represents a unique group of customers with similar spending patterns, which can be useful for targeted marketing or customer service strategies.
Gratitude
This project was a great introduction to unsupervised learning with K-Means and using the Elbow Method to find the optimal number of clusters. Learning how to visualize high-dimensional data with PCA also deepened my understanding of data representation. Looking forward to Day 15!
Stay tuned!
Posts in this series
- Day 26- Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
- Day 25 - Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
- Day 24 - K-Means Clustering to Segment Customers Based on Behavior
- Day 23 - Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
- Day 22 - Recommender System With Matrix Factorization
- Day 21 - Deploy a Machine Learning Model Using FastAPI and Heroku for Real-Time Predictions
- Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
- Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
- Day 18 - 30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
- Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
- Day 16 - 30 Days 30 ML Projects: Real-Time Face Detection in a Webcam Feed Using OpenCV
- Day 15 - 30 Days 30 ML Projects: Predict House Prices With XGBoost
- Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
- Day 13 - 30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
- Day 12 - 30 Days 30 Machine Learning Projects Challenge
- Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
- Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
- Day 9 - 30 Days 30 Machine Learning Projects
- Day 8 - 30 Days 30 Machine Learning Projects
- Day 7 - 30 Days 30 Machine Learning Projects
- Day 6 - 30 Days 30 Machine Learning Projects
- Day 5 - 30 Days 30 Machine Learning Projects
- Day 4 - 30 Days 30 Machine Learning Projects
- Day 3 - 30 Days 30 Machine Learning Projects
- Day 2 - 30 Days 30 Machine Learning Projects
- Day 1 - 30 Days 30 Machine Learning Projects
- 30 Days 30 Machine Learning Projects Challenge