Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests

On Day 17 of the 30 Days 30 Machine Learning Projects Challenge, the task was to predict whether a person would develop diabetes based on various medical factors such as glucose levels, insulin levels, and age. This is a binary classification problem where the goal is to predict if a person is diabetic (1) or not (0).

If you want to see the code, you can find it here: GIT REPO.

Understanding the Data

We used the Pima Indians Diabetes Dataset, which includes medical records of women aged 21 and above. The dataset contains various features related to pregnancies, glucose levels, blood pressure, skin thickness, and more. Here’s a glimpse of the data:

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
0            6      148             72             35        0  33.6                     0.627   50        1
1            1       85             66             29        0  26.6                     0.351   31        0
2            8      183             64              0        0  23.3                     0.672   32        1
3            1       89             66             23       94  28.1                     0.167   21        0
4            0      137             40             35      168  43.1                     2.288   33        1

Outcome:

1 means the patient is diabetic.
0 means the patient is not diabetic.

Code Workflow

Here’s the step-by-step breakdown of how I approached this problem:

Step 1: Load the Data

First, I loaded the dataset using Pandas to explore and understand the data.

import pandas as pd

# Load the dataset
data = pd.read_csv('dataset/diabetes.csv')
print(data.head())

Step 2: Preprocess the Data

Next, I separated the features (X) from the target (y).

X = data.drop('Outcome', axis=1)  # Features
y = data['Outcome']  # Target

Step 3: Split the Data

The data was then split into training and validation sets with an 80-20 split.

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Build and Train the Models

Decision Tree: I trained a Decision Tree Classifier as the first model.

from sklearn.tree import DecisionTreeClassifier

# Build and train the Decision Tree model
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)

Random Forest: Next, I trained a Random Forest Classifier with 100 trees and specific parameters.

from sklearn.ensemble import RandomForestClassifier

# Build and train the Random Forest model
random_forest = RandomForestClassifier(n_estimators=100, min_samples_leaf=1, min_samples_split=5, random_state=42)
random_forest.fit(X_train, y_train)

Step 5: Make Predictions and Evaluate

After training, I made predictions on the validation set and evaluated both models using accuracy, confusion matrices, and classification reports.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Decision Tree
dt_predictions = decision_tree.predict(X_val)
dt_accuracy = accuracy_score(y_val, dt_predictions)
dt_confusion_matrix = confusion_matrix(y_val, dt_predictions)
dt_classification_report = classification_report(y_val, dt_predictions)

print(f"Decision Tree Accuracy Score: {dt_accuracy}")
print(f"Decision Tree Confusion Matrix: {dt_confusion_matrix}")
print(f"Decision Tree Classification Report: {dt_classification_report}")

# Random Forest
rf_predictions = random_forest.predict(X_val)
rf_accuracy = accuracy_score(y_val, rf_predictions)
rf_confusion_matrix = confusion_matrix(y_val, rf_predictions)
rf_classification_report = classification_report(y_val, rf_predictions)

print(f"Random Forest Accuracy Score: {rf_accuracy}")
print(f"Random Forest Confusion Matrix: {rf_confusion_matrix}")
print(f"Random Forest Classification Report: {rf_classification_report}")

Results:

Decision Tree Accuracy: 74%
Random Forest Accuracy: 73%

Step 6: Visualization

I visualized the confusion matrices for both models using Seaborn heatmaps.

import matplotlib.pyplot as plt
import seaborn as sns

# Decision Tree
plt.figure(figsize=(7, 5))
sns.heatmap(dt_confusion_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['No Diabetes', 'Diabetes'], yticklabels=['No Diabetes', 'Diabetes'])
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Decision Tree Confusion Matrix')
plt.show()

# Random Forest
plt.figure(figsize=(7, 5))
sns.heatmap(rf_confusion_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['No Diabetes', 'Diabetes'], yticklabels=['No Diabetes', 'Diabetes'])
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Random Forest Confusion Matrix')
plt.show()

Model Performance

The Decision Tree performed slightly better with a 74% accuracy score, while the Random Forest model performed at 73% accuracy. I tried improving the Random Forest model by performing hyperparameter tuning.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Display the best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

After tuning, the best parameters were:

Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best Score: 0.783

However, the model’s accuracy remained consistent at 73%.

Gratitude

This project was a deep dive into comparing Decision Trees and Random Forests. I learned a lot about tuning models and how important it is to understand the trade-offs between complexity and performance.

Stay tuned for Day 18!