Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering

Hey, it’s Day 10 of the 30 Day 30 Machine Learning Projects Challenge. Today’s task was to build a Recommender System using Collaborative Filtering on a user-item ratings matrix. This was an exciting challenge that helped me understand how recommendation engines like the ones used by Netflix and Amazon work!

If you want to see the code, you can find it here: GIT REPO.

The Problem

The goal today was to predict how users would rate movies that they haven’t watched yet, based on the ratings they’ve given to other movies. This was done using Collaborative Filtering, a popular technique in recommendation systems.

What is Collaborative Filtering?

Collaborative Filtering is a method used by recommender systems to suggest items to users by looking at the preferences of similar users or similar items. There are two main types of collaborative filtering:

  • User-Based Collaborative Filtering: Recommends items to a user based on items liked by similar users.
  • Item-Based Collaborative Filtering: Recommends items similar to the ones the user has already liked.

For this project, I implemented Item-Based Collaborative Filtering, which focuses on finding similarities between movies based on user ratings and making predictions accordingly.

Cosine Similarity

To determine how similar two movies are, I used Cosine Similarity. This is a metric that measures how similar two vectors (in this case, movie ratings) are by calculating the angle between them.

  • If two movies are rated similarly by users, their cosine similarity will be close to 1 (very similar).
  • If two movies have very different ratings, their cosine similarity will be closer to 0 (not similar).

The formula for cosine similarity is: cosine_similarity(A, B) = A⋅B / ∣∣A∣∣×∣∣B∣∣ ​ Where:

  • A and B are the rating vectors for two movies.
  • The dot product is the sum of the product of corresponding elements from the two vectors.
  • The denominator normalizes the values to account for the magnitudes of the vectors.

Approach and Code Workflow

Step 1: Load the Data

I used the MovieLens dataset from Kaggle, which contains user ratings for movies. This dataset has information on users, movies, and the ratings given by users to different movies. Download, unzip and put it in the dataset directory of your project.

import pandas as pd

# Load the ratings dataset
ratings = pd.read_csv('dataset/ml-latest-small/ratings.csv')

# Load the movies dataset (optional for movie names)
movies = pd.read_csv('dataset/ml-latest-small/movies.csv')

Step 2: Create the User-Item Matrix

I created a matrix where rows represent users, columns represent movies, and the values represent the ratings given by users to movies.

user_item_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')
user_item_matrix.fillna(0, inplace=True)

Step 3: Calculate Cosine Similarity

To recommend movies based on similar ones, I used Cosine Similarity to calculate how similar the movies are based on their ratings.

from sklearn.metrics.pairwise import cosine_similarity

# Calculate the cosine similarity between items (movies)
item_similarity = cosine_similarity(user_item_matrix.T)  # Transpose to get movie-to-movie similarity
item_similarity_df = pd.DataFrame(item_similarity, index=user_item_matrix.columns, columns=user_item_matrix.columns)

Step 4: Make Predictions Based on Similarity

To predict how a user would rate a movie they haven’t rated yet, I used the similarity between movies and the ratings the user has given to similar movies.

import numpy as np

# Predict ratings
def predict_ratings(user_item_matrix, similarity_matrix):
    return np.dot(user_item_matrix, similarity_matrix) / np.abs(similarity_matrix).sum(axis=1)

# Make predictions using item similarity
predicted_ratings = predict_ratings(user_item_matrix.values, item_similarity)

# Convert the predictions back into a DataFrame for readability
predicted_ratings_df = pd.DataFrame(predicted_ratings, index=user_item_matrix.index, columns=user_item_matrix.columns)

Step 5: Evaluate the Model

I evaluated the model using Root Mean Squared Error (RMSE). RMSE tells us how far off our predicted ratings are from the actual ratings. The lower the RMSE, the better the model.

from sklearn.metrics import mean_squared_error

# Flatten the matrices and calculate RMSE
true_ratings = user_item_matrix.values.flatten()
predicted_ratings = predicted_ratings_df.values.flatten()

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(true_ratings[true_ratings > 0], predicted_ratings[true_ratings > 0]))
print(f"Root Mean Squared Error: {rmse}")

Unfortunately, the RMSE came out to be 9.89, which is quite high, given that the ratings in the dataset range from 1 to 5. This suggests that the model’s predictions were not very accurate.

Model Performance

The RMSE value of 9.89 means the predicted ratings are quite far off from the actual ratings, indicating that this simple collaborative filtering model isn’t performing very well. There are several potential improvements we could make, such as:

  • Using Advanced Algorithms: Models like Matrix Factorization (SVD) or ALS (Alternating Least Squares) handle sparse data better and could reduce the error.
  • Feature Engineering: We could add additional features, such as user preferences, genres, or movie popularity, to improve the accuracy of the predictions.

Gratitude

This project was a great learning experience, even though the model didn’t perform as expected. I’m looking forward to diving deeper into more advanced recommendation algorithms in future projects.

Stay tuned!