Day 20: Building an Autoencoder-Based Anomaly Detection System (Part 1: Data and Model Setup)

Today, we began building an autoencoder-based anomaly detection system using the Credit Card Fraud Detection Dataset from Kaggle. Our main goal for today was to set up the data and build an initial version of the autoencoder model to detect anomalies. Fine-tuning and evaluation will be covered tomorrow.

Step 1: Data Preparation

The first step was to prepare the data. Since the dataset deals with credit card fraud, it contains mostly normal transactions, with only a small percentage of fraudulent transactions, making it a highly imbalanced dataset.

Import Necessary Libraries

We started by importing all the necessary libraries for data processing, model building, and visualization.

import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import matplotlib.pyplot as plt
  • Pandas and Numpy: Used for handling and manipulating data.
  • TensorFlow/Keras: Used for building the autoencoder model.
  • Matplotlib: Used for visualizing the training progress.
  • StandardScaler: Used to normalize the data for better model performance.

Load and Preprocess the Data

We loaded the credit card fraud dataset and normalized it for training the autoencoder.

# Load the dataset
data = pd.read_csv('creditcard.csv')

# Display basic information
print(data.head())
print(data.info())
  • Dataset Overview: The dataset includes features V1, V2, …, V28, which are PCA-transformed features, as well as Time, Amount, and the target label Class (0 for normal transactions, 1 for fraud).

Data Processing Steps

The data needs to be properly processed before feeding it into the autoencoder.

# Extract features and labels
X = data.drop(columns=['Class', 'Time'])
y = data['Class']

# Standardize the 'Amount' column and other features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Use only non-fraudulent data to train the autoencoder
X_train_normal = X_train[y_train == 0]
  • Feature Extraction: We used all features except Class and Time.
  • Scaling: We used StandardScaler to scale features to have a mean of 0 and a standard deviation of 1. This is important to help the model converge during training.
  • Data Splitting: We split the data into training (80%) and test (20%) sets.
  • Train on Normal Data Only: We filtered only the normal transactions (y_train == 0) for training the autoencoder. The autoencoder needs to learn what normal data looks like, which is essential for detecting anomalies.

Step 2: Set Up the Autoencoder Model

An autoencoder is made up of two main components:

  • Encoder: Compresses the data into a smaller representation.
  • Decoder: Attempts to reconstruct the original data from the compressed representation.

Build the Autoencoder Model

# Build the autoencoder model
model = Sequential()

# Encoder layers
model.add(Dense(14, activation='relu', input_shape=(X_train.shape[1],)))  # First layer with 14 neurons
model.add(Dense(7, activation='relu'))  # Reduced to 7 neurons

# Decoder layers
model.add(Dense(14, activation='relu'))  # Upsample back to 14 neurons
model.add(Dense(X_train.shape[1], activation='linear'))  # Final layer to reconstruct original input shape

# Compile the model
model.compile(optimizer='adam', loss='mse')
model.summary()

Explanation

  • Encoder Part: The encoder has two layers that reduce the original feature space from 30 features to 7 features. This step captures the essential information while reducing noise.
  • Decoder Part: The decoder tries to reconstruct the original features from the smaller representation.
  • Linear Activation in the final layer is used because we want the model to output continuous real values to match the original input features.
  • Loss Function: We used Mean Squared Error (MSE) as the loss function, which measures how well the model is reconstructing the input.

Model Summary

The model summary gives an overview of the number of parameters and the layers used in the autoencoder.

Step 3: Train the Autoencoder

The next step is to train the autoencoder on normal data only. This way, the model learns to reconstruct typical transaction patterns.

# Train the autoencoder
history = model.fit(X_train_normal, X_train_normal,
                    epochs=50,
                    batch_size=256,
                    validation_split=0.2,
                    verbose=1)
  • Training on Normal Data: We train the autoencoder on only normal transactions to learn normal behavior patterns.
  • Epochs and Batch Size: We used 50 epochs and a batch size of 256. The number of epochs is the number of complete passes through the training data, and batch size determines how many samples are processed before the model is updated.
  • Validation Split: We used 20% of the training data for validation to monitor the model’s performance during training.

Plot Training Loss

We used a loss plot to monitor how well the model is learning to reconstruct the normal transactions.

# Plot the training and validation loss
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.title('Training and Validation Loss for Autoencoder')
plt.show()
  • Loss Plot: The training loss should ideally decrease over time, indicating that the model is improving. The validation loss helps to check if the model is overfitting (performing well on training data but poorly on validation data).

Summary of Part 1: Data and Model Setup

  • We loaded and processed the credit card dataset, separating it into features and labels. The features were scaled for optimal performance.
  • We built an autoencoder model that consists of an encoder and decoder. The encoder reduces the input to a latent space, and the decoder reconstructs the original features.
  • The model was trained on only normal transactions to learn what normal patterns look like, which is key for detecting anomalies based on reconstruction errors.

Next Steps for Part 2 (Tomorrow)

  1. Fine-Tune the Model: Adjust hyperparameters and make improvements to the training process.
  2. Detect Anomalies: Use reconstruction error to classify transactions as either normal or fraudulent.
  3. Evaluate the Model: Assess model performance using metrics like Precision, Recall, and AUC to understand how well the model detects anomalies.

With today’s progress, we are ready to take on anomaly detection and fine-tuning tomorrow.

Video