Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost

Today’s challenge was about predicting customer churn, a crucial task for businesses that want to identify customers who are likely to cancel their subscriptions. We used an XGBoost model for this task, leveraging its powerful gradient boosting algorithm to achieve a high-performing classifier.

If you want to see the code, you can find it here: GIT REPO.

Understanding the Data

We used a Telco Customer Churn dataset from Kaggle, which contains customer information such as gender, tenure, contract type, and monthly charges. The target variable is Churn, indicating whether the customer churned (Yes/No). Here’s a quick look at the data:

customerID | gender | SeniorCitizen | Partner | Dependents | tenure | Contract | MonthlyCharges | Churn
7590-VHVEG | Female | 0             | Yes     | No         | 1      | Month-to-month | 29.85        | No

Code Walkthrough

Let’s import the libraries first

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

Step 1: Load the Data

data = pd.read_csv('dataset/telco_customer_churn.csv')

We loaded the data and took a look at its structure.

Step 2: Handle Missing Values

One of the columns, TotalCharges, had some missing values. We filled these values with the median for simplicity.

data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
data['TotalCharges'].fillna(data['TotalCharges'].median(), inplace=True)

Step 3: Encode the Categorical Data

To prepare the categorical data, we used a hybrid encoding approach:

Label Encoding for binary features (like gender, Partner, Dependents).
One-Hot Encoding for multi-category features (like Contract, PaymentMethod).

# Label Encoding for binary columns
label_enc = LabelEncoder()
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']
for col in binary_cols:
    data[col] = label_enc.fit_transform(data[col])

# One-Hot Encoding for multi-category columns
multi_cat_cols = ['Contract', 'PaymentMethod', 'InternetService']
data_encoded = pd.get_dummies(data, columns=multi_cat_cols, drop_first=True)

Step 4: Split the Data into Features and Target

X = data_encoded.drop('Churn', axis=1)
y = data_encoded['Churn']

We used Churn as the target and dropped it from the feature set.

Step 5: Split the Data into Training and Validation Sets

We split the data into 80% training and 20% validation sets.

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6: Build and Train the XGBoost Model

We built an XGBoost model and trained it on the data.

model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

Step 7: Evaluate the Model

After training, we made predictions on the validation set and evaluated the model’s performance using accuracy, a confusion matrix, and a classification report.

predictions = model.predict(X_val)
accuracy = accuracy_score(y_val, predictions)
conf_matrix = confusion_matrix(y_val, predictions)
class_report = classification_report(y_val, predictions)

print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

Model Performance

The model achieved an accuracy of 100%, which is solid for a churn prediction model.

Accuracy Score: 1.0
Confusion Matrix:
 [[1036    0]
 [   0  373]]
Classification Report:
      precision    recall  f1-score   support
           0       1.00      1.00      1.00      1036
           1       1.00      1.00      1.00       373
    accuracy                           1.00      1409
   macro avg       1.00      1.00      1.00      1409
weighted avg       1.00      1.00      1.00      1409

Gratitude

This challenge deepened my understanding of handling customer churn data and using XGBoost for classification tasks. It was great exploring different encoding techniques to preprocess categorical data efficiently.

Stay Tuned!