Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
Today’s challenge was about predicting customer churn, a crucial task for businesses that want to identify customers who are likely to cancel their subscriptions. We used an XGBoost model for this task, leveraging its powerful gradient boosting algorithm to achieve a high-performing classifier.
If you want to see the code, you can find it here: GIT REPO.
Understanding the Data
We used a Telco Customer Churn dataset from Kaggle, which contains customer information such as gender, tenure, contract type, and monthly charges. The target variable is Churn, indicating whether the customer churned (Yes/No). Here’s a quick look at the data:
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | Contract | MonthlyCharges | Churn
7590-VHVEG | Female | 0 | Yes | No | 1 | Month-to-month | 29.85 | No
Code Walkthrough
Let’s import the libraries first
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
Step 1: Load the Data
data = pd.read_csv('dataset/telco_customer_churn.csv')
We loaded the data and took a look at its structure.
Step 2: Handle Missing Values
One of the columns, TotalCharges, had some missing values. We filled these values with the median for simplicity.
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
data['TotalCharges'].fillna(data['TotalCharges'].median(), inplace=True)
Step 3: Encode the Categorical Data
To prepare the categorical data, we used a hybrid encoding approach:
- Label Encoding for binary features (like gender, Partner, Dependents).
- One-Hot Encoding for multi-category features (like Contract, PaymentMethod).
# Label Encoding for binary columns
label_enc = LabelEncoder()
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']
for col in binary_cols:
data[col] = label_enc.fit_transform(data[col])
# One-Hot Encoding for multi-category columns
multi_cat_cols = ['Contract', 'PaymentMethod', 'InternetService']
data_encoded = pd.get_dummies(data, columns=multi_cat_cols, drop_first=True)
Step 4: Split the Data into Features and Target
X = data_encoded.drop('Churn', axis=1)
y = data_encoded['Churn']
We used Churn as the target and dropped it from the feature set.
Step 5: Split the Data into Training and Validation Sets
We split the data into 80% training and 20% validation sets.
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
Step 6: Build and Train the XGBoost Model
We built an XGBoost model and trained it on the data.
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)
Step 7: Evaluate the Model
After training, we made predictions on the validation set and evaluated the model’s performance using accuracy, a confusion matrix, and a classification report.
predictions = model.predict(X_val)
accuracy = accuracy_score(y_val, predictions)
conf_matrix = confusion_matrix(y_val, predictions)
class_report = classification_report(y_val, predictions)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
Model Performance
The model achieved an accuracy of 100%, which is solid for a churn prediction model.
Accuracy Score: 1.0
Confusion Matrix:
[[1036 0]
[ 0 373]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 1036
1 1.00 1.00 1.00 373
accuracy 1.00 1409
macro avg 1.00 1.00 1.00 1409
weighted avg 1.00 1.00 1.00 1409
Gratitude
This challenge deepened my understanding of handling customer churn data and using XGBoost for classification tasks. It was great exploring different encoding techniques to preprocess categorical data efficiently.
Stay Tuned!