Day 21: Fine-Tune and Evaluate Autoencoder Model for Anomaly Detection

Today, we continued with our autoencoder-based anomaly detection by fine-tuning the model and evaluating its performance on the Credit Card Fraud Detection Dataset. Our objectives were:

Fine-Tune the Autoencoder Model: Improve model performance by adjusting hyperparameters.
Determine Reconstruction Error Threshold: Use reconstruction error to classify normal vs. fraudulent transactions.
Evaluate Performance: Utilize metrics like Precision, Recall, F1-Score, and AUC to understand the model’s effectiveness.

Step 1: Fine-Tuning the Autoencoder Model

In this step, we experimented with different model configurations and hyperparameters to try and reduce the reconstruction error.

Modify the Model Architecture

We experimented with the number of neurons and layers to see if a different configuration would yield better results.

# Adjust the model complexity
model = Sequential()

# Encoder
model.add(Dense(20, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(10, activation='relu'))

# Latent representation
model.add(Dense(5, activation='relu'))  # Smaller latent space to capture key features

# Decoder
model.add(Dense(10, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(X_train.shape[1], activation='linear'))

# Compile with modified learning rate
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mse')
model.summary()

# Train the model
history = model.fit(X_train_normal, X_train_normal,
                    epochs=100,
                    batch_size=128,
                    validation_split=0.2,
                    verbose=1)

Encoder/Decoder Changes: We added more layers and neurons to increase model complexity. A smaller latent space helped in focusing on key patterns.
Learning Rate: The learning rate was set to 0.001 to ensure smoother convergence.
Training Epochs: Increased the number of epochs to 100 for more training time.

Plot Training Loss

We plotted the training and validation loss to understand if the model was learning effectively.

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.title('Training and Validation Loss for Fine-Tuned Autoencoder')
plt.show()

A decreasing trend in both losses indicated that the model was improving in reconstruction. If there was a large gap between the two, it would indicate overfitting.

Step 2: Set Threshold for Anomaly Detection

After training the autoencoder, we calculated the reconstruction error for all transactions in the test set and used this to classify transactions as either normal or anomalous.

# Predict the reconstructed test data
X_test_pred = model.predict(X_test)

# Calculate reconstruction error
reconstruction_errors = np.mean(np.power(X_test - X_test_pred, 2), axis=1)

Reconstruction Error: Calculated using the Mean Squared Error (MSE) between X_test and X_test_pred for each data sample.

Determine the Threshold

We set a threshold based on the 95th percentile of the reconstruction errors for the normal transactions in the training set.

threshold = np.percentile(reconstruction_errors[y_test == 0], 95)
print(f"Threshold for anomaly detection: {threshold}")

A high reconstruction error indicates an anomaly since the autoencoder has difficulty reconstructing fraudulent transactions.
The 95th percentile was chosen to allow for a balance between false positives and true negatives.

Step 3: Classify and Evaluate

Using the reconstruction error threshold, we classified each test transaction as normal or anomalous.

y_pred = [1 if error > threshold else 0 for error in reconstruction_errors]

# Actual labels for evaluation
print(f"Actual Anomalies: {sum(y_test)}, Detected: {sum(y_pred)}")

Threshold-based Classification: Transactions with reconstruction errors above the threshold are labeled as 1 (fraud), while those below are labeled as 0 (normal).

Evaluate Model Performance

We used classification metrics to evaluate the model’s performance:

from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# AUC Score
auc_score = roc_auc_score(y_test, reconstruction_errors)
print(f"AUC Score: {auc_score:.2f}")

Confusion Matrix: Shows True Negatives, False Positives, False Negatives, and True Positives.
Classification Report: Includes Precision, Recall, and F1-Score.
- Precision tells us how many flagged transactions were actually fraud.
- Recall tells us how many of the fraudulent transactions were detected.
- F1-Score is a harmonic mean of precision and recall.
AUC Score: Measures the model’s ability to distinguish between positive and negative classes. A higher AUC is better.

Observations from Evaluation

Threshold for anomaly detection is: 1.0146805608893366
Actual Anomalies: 98, Detected: 2932
Confusion Matrix:
 [[54020  2844]
 [   10    88]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.95      0.97     56864
           1       0.03      0.90      0.06        98

    accuracy                           0.95     56962
   macro avg       0.51      0.92      0.52     56962
weighted avg       1.00      0.95      0.97     56962

The metrics showed the following:

The model had a high recall, meaning it was good at detecting fraudulent transactions, but it also had a low precision, meaning there were many false positives.
High False Positive Rate: Many normal transactions were incorrectly flagged as fraud, which indicates that the threshold may need further fine-tuning to reduce these false positives.

Improving the Model

To improve the model’s performance, we could try:

Adjusting the Threshold: Trying different percentile values or using a validation dataset to determine the best threshold value.
Model Complexity: Adding more neurons or layers, or even using a Variational Autoencoder (VAE) for better results.
Resampling the Data: To handle the imbalance in the dataset, we could use oversampling for the fraud class to help the model learn these patterns better.