Day 29 - Credit Risk Prediction With Logistic Regression and SVM

On Day 29, I focused on building models to predict credit risk using both Logistic Regression and Support Vector Machines (SVM). The dataset contains various financial and personal details of individuals applying for credit, and the goal was to predict whether an applicant poses a credit risk.

If you want to see the code, you can find it here: GIT REPO.

Dataset:

I used Germen Credit Risk dataset from Kaggle, which includes various features like Age, Sex, Job, Housing, Saving accounts, Checking accounts, Credit amount, Duration, and Purpose. The task was to predict whether a loan applicant is at credit risk (target: Risk = 1) or not (target: Risk = 0).

Code

# Problem: 	Credit risk prediction with Logistic Regression and SVM
# Dataset: https://www.kaggle.com/datasets/uciml/german-credit

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Step 1: Load the data
data = pd.read_csv('dataset/german_credit_data.csv')

# Step 2: Preprocess the data
# Remove the first column as it is unnamed in the csv file.
data = data.iloc[:, 1:]

# print(data.isnull().sum())  # Check the count of missing values in dataset
# Saving accounts has 183 missing values.
# Checking account has 394 missing values.

data['Saving accounts'].fillna('unknown', inplace=True)
data['Checking account'].fillna('unknown', inplace=True)

# Convert the categorical columns into numeric using One-Hot Encoding
data = pd.get_dummies(data, columns=['Sex', 'Housing', 'Saving accounts', 'Checking account', 'Purpose'], drop_first=True)

# Step 3: Create Feature and Target datasets
# Define a simple rule of generating risk column
# If the account has credit amount of 5000 and the Duration is more than 24 hours, it is considered a high risk.
data['risk'] = ((data['Credit amount'] > 5000) & (data['Duration'] > 24)).astype(int)

X = data.drop('risk', axis=1) # Features
y = data['risk'] # Target

# Step 4: Split the data into training and validation datasets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Step 6: Build and Train the Logistic Regression Model
model_log_reg = LogisticRegression()
model_log_reg.fit(X_train_scaled, y_train)

# Step 7: Build and Train the SVM Model
model_svm = SVC()
model_svm.fit(X_train_scaled, y_train)

# Step 8: Make Prediction and Evaluate Logistic Regression Model
log_reg_predictions = model_log_reg.predict(X_val_scaled)
accuracy_score_lg = accuracy_score(log_reg_predictions, y_val)
confusion_matrix_lg = confusion_matrix(log_reg_predictions, y_val)
classification_report_log_reg = classification_report(log_reg_predictions, y_val)

print("Logistic Regression: ")
print(f"Accuracy Score: {accuracy_score_lg}")
print(f"Confusion Matrix: \n {confusion_matrix_lg}")
print(f"Classification Report: \n {classification_report_log_reg}")

# Step 9: Make Prediction and Evaluate the SVM Model
svm_predictions = model_svm.predict(X_val_scaled)
accuracy_score_svm = accuracy_score(svm_predictions, y_val)
confusion_matrix_svm = confusion_matrix(svm_predictions, y_val)
classification_report_svm = classification_report(svm_predictions, y_val)

print("SVM Model: ")
print(f"Accuracy Score: {accuracy_score_svm}")
print(f"Confusion Matrix: \n {confusion_matrix_svm}")
print(f"Classification Report: \n {classification_report_svm}")

Understand the code:

Step 1: Load the Data
- I loaded the dataset and removed the first unnamed index column, as it wasn’t needed for modeling.
Step 2: Data Preprocessing
- Handling Missing Values:
  - The Saving accounts column had 183 missing values, and the Checking accounts column had 394 missing values. I filled these missing values with ‘unknown’.
- One-Hot Encoding:
  - I converted categorical columns such as Sex, Housing, Saving accounts, Checking account, and Purpose into numerical values using One-Hot Encoding.
Step 3: Create Feature and Target Datasets
- I created a simple risk rule where:
  - If a Credit amount is greater than 5000 and the Duration is more than 24 months, the applicant is considered a high risk.
- The target variable is labeled as risk, and all other columns are used as features.
Step 4: Split the Data
- I split the dataset into training and validation sets, with 80% of the data used for training and 20% for validation.
Step 5: Feature Scaling
- I applied StandardScaler to normalize the features, ensuring both the Logistic Regression and SVM models would perform optimally.
Step 6: Build and Train the Models
- I trained both the Logistic Regression and SVM models using the scaled training data.

Step 7 & 8: Evaluation of Logistic Regression

Logistic Regression:
Accuracy Score: 0.965
Confusion Matrix:
[[175   6]
[  1  18]]
Classification Report:
              precision    recall  f1-score   support

          0       0.99      0.97      0.98       181
          1       0.75      0.95      0.84        19

    accuracy                           0.96       200
  macro avg       0.87      0.96      0.91       200
weighted avg       0.97      0.96      0.97       200

SVM Model:
Accuracy Score: 0.96
Confusion Matrix:
[[176   8]
[  0  16]]
Classification Report:
              precision    recall  f1-score   support

          0       1.00      0.96      0.98       184
          1       0.67      1.00      0.80        16

    accuracy                           0.96       200
  macro avg       0.83      0.98      0.89       200
weighted avg       0.97      0.96      0.96       200

Key Insights:

Logistic Regression achieved high precision (0.99) and recall (0.97) for predicting non-risky loans (class 0). It also performed well in detecting risky loans (class 1) with a recall of 0.95.
SVM showed perfect recall (1.00) for identifying all risky loans (class 1), but had a slightly lower precision (0.67), meaning it was more likely to misclassify non-risky cases as risky.
Both models showed high accuracy (96%+), but Logistic Regression had a better balance between precision and recall.

Gratitude

What a blast revisiting the models on a new dataset! It’s Day 29 of the challenge, and guess what? Tomorrow is the BIG day! 🎉

Stay Tuned!