Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest

The Problem

On Day 11 of the 30 Day 30 Machine Learning Projects Challenge, we focused on detecting credit card fraud using an Isolation Forest model. The goal was to identify anomalies in transaction data, labeling these anomalies as potential fraud cases.

If you want to see the code, you can find it here: GIT REPO.

Understanding the Data

We used the Credit Card Fraud Detection Dataset from Kaggle. The dataset includes transactions labeled as either normal (0) or fraud (1). In this project, we used Isolation Forest to separate the normal and fraudulent transactions.

Code Workflow

The steps involved were as follows:

Load the Data
Create Feature and Target datasets
Split the Data
Build and Train the Model
Make Predictions and Evaluate
Visualization

Step 1: Load the data

Download the data from kaggle and put it in the dataset directory at the root of your project.

data = pd.read_csv('dataset/creditcard.csv')

Step 2: Create Feature and Target Datasets

We separated the features (X) and target labels (y). Additionally, we mapped the target labels for consistency with the Isolation Forest model, where 1 represents normal transactions and -1 represents fraud (anomalies).

X = data.drop('Class', axis=1)
y = data['Class'].map({0: 1, 1: -1})  # 1: Normal, -1: Fraud/Anomaly

Step 3: Split the Data

We split the dataset into 80% training and 20% validation sets. To ensure the distribution of normal and fraud transactions remains balanced across the training and validation sets, we used stratification:

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Step 4: Build and Train the Isolation Forest Model

We used Isolation Forest, which is an unsupervised algorithm for anomaly detection. The contamination parameter was set to 0.01 (assuming 1% of the data are anomalies).

model = IsolationForest(contamination=0.01, random_state=42)
model.fit(X_train)

Step 5: Make Predictions and Evaluate

The model predicted whether each transaction was normal (1) or an anomaly (-1). We used a confusion matrix and other metrics to evaluate the model’s performance.

predictions = model.predict(X_val)
X_val['anomaly'] = predictions

accuracy = accuracy_score(y_val, predictions)
conf_matrix = confusion_matrix(y_val, predictions)
class_report = classification_report(y_val, predictions, zero_division=1)

print(f"Accuracy Score: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

Key Metrics:

True Positives (fraud correctly identified as fraud)
False Positives (normal transactions mistakenly flagged as fraud)
False Negatives (fraud missed by the model)
True Negatives (normal transactions correctly identified)

Step 6: Visualization

We created a scatter plot to visualize the anomalies versus normal transactions based on two features (V1 and V2), and used a confusion matrix heatmap to show the model’s performance.

Scatter Plot:

plt.figure(figsize=(10, 6))
plt.scatter(X_val['V1'], X_val['V2'], c=predictions, cmap='coolwarm', label='Anomalies')
plt.xlabel('V1')
plt.ylabel('V2')
plt.title('Isolation Forest: Anomalies vs Normal Transactions')
plt.legend()
plt.show()

Confusion Matrix Heatmap:

plt.figure(figsize=(10, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Anomaly', 'Normal'], yticklabels=['True Anomaly', 'True Normal'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix Heatmap')
plt.show()

Model Performance

With 99% of accuracy the Isolation Forest model performed well in detecting anomalies. We can see a high rate of true normal transactions but some missed fraud cases and false positives.

Accuracy Score:
 0.989431550858467
Confusion Matrix:
 [[   53    45]
 [  557 56307]]
Classfication Report:
               precision    recall  f1-score   support

          -1       0.09      0.54      0.15        98
           1       1.00      0.99      0.99     56864

    accuracy                           0.99     56962
   macro avg       0.54      0.77      0.57     56962
weighted avg       1.00      0.99      0.99     56962

53 True Anomalies (Fraud) were correctly identified.
45 Fraud Cases were missed by the model.
557 False Positives were normal transactions mistakenly flagged as fraud.
56,307 True Normals were correctly identified as normal transactions.

Gratitude

Working with unsupervised learning and anomaly detection was a great learning experience. Stay tuned for Day 12!