Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
Today’s challenge was about predicting customer churn, a crucial task for businesses that want to identify customers who are likely to cancel their subscriptions. We used an XGBoost model for this task, leveraging its powerful gradient boosting algorithm to achieve a high-performing classifier.
If you want to see the code, you can find it here: GIT REPO.
Understanding the Data
We used a Telco Customer Churn dataset from Kaggle, which contains customer information such as gender, tenure, contract type, and monthly charges. The target variable is Churn, indicating whether the customer churned (Yes/No). Here’s a quick look at the data:
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | Contract | MonthlyCharges | Churn
7590-VHVEG | Female | 0 | Yes | No | 1 | Month-to-month | 29.85 | No
Code Walkthrough
Let’s import the libraries first
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
Step 1: Load the Data
data = pd.read_csv('dataset/telco_customer_churn.csv')
We loaded the data and took a look at its structure.
Step 2: Handle Missing Values
One of the columns, TotalCharges, had some missing values. We filled these values with the median for simplicity.
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
data['TotalCharges'].fillna(data['TotalCharges'].median(), inplace=True)
Step 3: Encode the Categorical Data
To prepare the categorical data, we used a hybrid encoding approach:
- Label Encoding for binary features (like gender, Partner, Dependents).
- One-Hot Encoding for multi-category features (like Contract, PaymentMethod).
# Label Encoding for binary columns
label_enc = LabelEncoder()
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']
for col in binary_cols:
data[col] = label_enc.fit_transform(data[col])
# One-Hot Encoding for multi-category columns
multi_cat_cols = ['Contract', 'PaymentMethod', 'InternetService']
data_encoded = pd.get_dummies(data, columns=multi_cat_cols, drop_first=True)
Step 4: Split the Data into Features and Target
X = data_encoded.drop('Churn', axis=1)
y = data_encoded['Churn']
We used Churn as the target and dropped it from the feature set.
Step 5: Split the Data into Training and Validation Sets
We split the data into 80% training and 20% validation sets.
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
Step 6: Build and Train the XGBoost Model
We built an XGBoost model and trained it on the data.
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)
Step 7: Evaluate the Model
After training, we made predictions on the validation set and evaluated the model’s performance using accuracy, a confusion matrix, and a classification report.
predictions = model.predict(X_val)
accuracy = accuracy_score(y_val, predictions)
conf_matrix = confusion_matrix(y_val, predictions)
class_report = classification_report(y_val, predictions)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
Model Performance
The model achieved an accuracy of 100%, which is solid for a churn prediction model.
Accuracy Score: 1.0
Confusion Matrix:
[[1036 0]
[ 0 373]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 1036
1 1.00 1.00 1.00 373
accuracy 1.00 1409
macro avg 1.00 1.00 1.00 1409
weighted avg 1.00 1.00 1.00 1409
Gratitude
This challenge deepened my understanding of handling customer churn data and using XGBoost for classification tasks. It was great exploring different encoding techniques to preprocess categorical data efficiently.
Stay Tuned!
Posts in this series
- Day 26- Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
- Day 25 - Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
- Day 24 - K-Means Clustering to Segment Customers Based on Behavior
- Day 23 - Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
- Day 22 - Recommender System With Matrix Factorization
- Day 21 - Deploy a Machine Learning Model Using FastAPI and Heroku for Real-Time Predictions
- Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
- Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
- Day 18 - 30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
- Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
- Day 16 - 30 Days 30 ML Projects: Real-Time Face Detection in a Webcam Feed Using OpenCV
- Day 15 - 30 Days 30 ML Projects: Predict House Prices With XGBoost
- Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
- Day 13 - 30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
- Day 12 - 30 Days 30 Machine Learning Projects Challenge
- Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
- Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
- Day 9 - 30 Days 30 Machine Learning Projects
- Day 8 - 30 Days 30 Machine Learning Projects
- Day 7 - 30 Days 30 Machine Learning Projects
- Day 6 - 30 Days 30 Machine Learning Projects
- Day 5 - 30 Days 30 Machine Learning Projects
- Day 4 - 30 Days 30 Machine Learning Projects
- Day 3 - 30 Days 30 Machine Learning Projects
- Day 2 - 30 Days 30 Machine Learning Projects
- Day 1 - 30 Days 30 Machine Learning Projects
- 30 Days 30 Machine Learning Projects Challenge