Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
The Problem
On Day 11 of the 30 Day 30 Machine Learning Projects Challenge, we focused on detecting credit card fraud using an Isolation Forest model. The goal was to identify anomalies in transaction data, labeling these anomalies as potential fraud cases.
If you want to see the code, you can find it here: GIT REPO.
Understanding the Data
We used the Credit Card Fraud Detection Dataset from Kaggle. The dataset includes transactions labeled as either normal (0) or fraud (1). In this project, we used Isolation Forest to separate the normal and fraudulent transactions.
Code Workflow
The steps involved were as follows:
- Load the Data
- Create Feature and Target datasets
- Split the Data
- Build and Train the Model
- Make Predictions and Evaluate
- Visualization
Step 1: Load the data
Download the data from kaggle and put it in the dataset
directory at the root of your project.
data = pd.read_csv('dataset/creditcard.csv')
Step 2: Create Feature and Target Datasets
We separated the features (X) and target labels (y). Additionally, we mapped the target labels for consistency with the Isolation Forest model, where 1 represents normal transactions and -1 represents fraud (anomalies).
X = data.drop('Class', axis=1)
y = data['Class'].map({0: 1, 1: -1}) # 1: Normal, -1: Fraud/Anomaly
Step 3: Split the Data
We split the dataset into 80% training and 20% validation sets. To ensure the distribution of normal and fraud transactions remains balanced across the training and validation sets, we used stratification:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
Step 4: Build and Train the Isolation Forest Model
We used Isolation Forest, which is an unsupervised algorithm for anomaly detection. The contamination parameter was set to 0.01 (assuming 1% of the data are anomalies).
model = IsolationForest(contamination=0.01, random_state=42)
model.fit(X_train)
Step 5: Make Predictions and Evaluate
The model predicted whether each transaction was normal (1) or an anomaly (-1). We used a confusion matrix and other metrics to evaluate the model’s performance.
predictions = model.predict(X_val)
X_val['anomaly'] = predictions
accuracy = accuracy_score(y_val, predictions)
conf_matrix = confusion_matrix(y_val, predictions)
class_report = classification_report(y_val, predictions, zero_division=1)
print(f"Accuracy Score: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
Key Metrics:
- True Positives (fraud correctly identified as fraud)
- False Positives (normal transactions mistakenly flagged as fraud)
- False Negatives (fraud missed by the model)
- True Negatives (normal transactions correctly identified)
Step 6: Visualization
We created a scatter plot to visualize the anomalies versus normal transactions based on two features (V1 and V2), and used a confusion matrix heatmap to show the model’s performance.
Scatter Plot:
plt.figure(figsize=(10, 6))
plt.scatter(X_val['V1'], X_val['V2'], c=predictions, cmap='coolwarm', label='Anomalies')
plt.xlabel('V1')
plt.ylabel('V2')
plt.title('Isolation Forest: Anomalies vs Normal Transactions')
plt.legend()
plt.show()
Confusion Matrix Heatmap:
plt.figure(figsize=(10, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Anomaly', 'Normal'], yticklabels=['True Anomaly', 'True Normal'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix Heatmap')
plt.show()
Model Performance
With 99% of accuracy the Isolation Forest model performed well in detecting anomalies. We can see a high rate of true normal transactions but some missed fraud cases and false positives.
Accuracy Score:
0.989431550858467
Confusion Matrix:
[[ 53 45]
[ 557 56307]]
Classfication Report:
precision recall f1-score support
-1 0.09 0.54 0.15 98
1 1.00 0.99 0.99 56864
accuracy 0.99 56962
macro avg 0.54 0.77 0.57 56962
weighted avg 1.00 0.99 0.99 56962
- 53 True Anomalies (Fraud) were correctly identified.
- 45 Fraud Cases were missed by the model.
- 557 False Positives were normal transactions mistakenly flagged as fraud.
- 56,307 True Normals were correctly identified as normal transactions.
Gratitude
Working with unsupervised learning and anomaly detection was a great learning experience. Stay tuned for Day 12!
Posts in this series
- Day 26- Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
- Day 25 - Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
- Day 24 - K-Means Clustering to Segment Customers Based on Behavior
- Day 23 - Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
- Day 22 - Recommender System With Matrix Factorization
- Day 21 - Deploy a Machine Learning Model Using FastAPI and Heroku for Real-Time Predictions
- Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
- Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
- Day 18 - 30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
- Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
- Day 16 - 30 Days 30 ML Projects: Real-Time Face Detection in a Webcam Feed Using OpenCV
- Day 15 - 30 Days 30 ML Projects: Predict House Prices With XGBoost
- Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
- Day 13 - 30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
- Day 12 - 30 Days 30 Machine Learning Projects Challenge
- Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
- Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
- Day 9 - 30 Days 30 Machine Learning Projects
- Day 8 - 30 Days 30 Machine Learning Projects
- Day 7 - 30 Days 30 Machine Learning Projects
- Day 6 - 30 Days 30 Machine Learning Projects
- Day 5 - 30 Days 30 Machine Learning Projects
- Day 4 - 30 Days 30 Machine Learning Projects
- Day 3 - 30 Days 30 Machine Learning Projects
- Day 2 - 30 Days 30 Machine Learning Projects
- Day 1 - 30 Days 30 Machine Learning Projects
- 30 Days 30 Machine Learning Projects Challenge