Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
Hey, it’s Day 10 of the 30 Day 30 Machine Learning Projects Challenge. Today’s task was to build a Recommender System using Collaborative Filtering on a user-item ratings matrix. This was an exciting challenge that helped me understand how recommendation engines like the ones used by Netflix and Amazon work!
If you want to see the code, you can find it here: GIT REPO.
The Problem
The goal today was to predict how users would rate movies that they haven’t watched yet, based on the ratings they’ve given to other movies. This was done using Collaborative Filtering, a popular technique in recommendation systems.
What is Collaborative Filtering?
Collaborative Filtering is a method used by recommender systems to suggest items to users by looking at the preferences of similar users or similar items. There are two main types of collaborative filtering:
- User-Based Collaborative Filtering: Recommends items to a user based on items liked by similar users.
- Item-Based Collaborative Filtering: Recommends items similar to the ones the user has already liked.
For this project, I implemented Item-Based Collaborative Filtering, which focuses on finding similarities between movies based on user ratings and making predictions accordingly.
Cosine Similarity
To determine how similar two movies are, I used Cosine Similarity. This is a metric that measures how similar two vectors (in this case, movie ratings) are by calculating the angle between them.
- If two movies are rated similarly by users, their cosine similarity will be close to 1 (very similar).
- If two movies have very different ratings, their cosine similarity will be closer to 0 (not similar).
The formula for cosine similarity is: cosine_similarity(A, B) = A⋅B / ∣∣A∣∣×∣∣B∣∣ Where:
- A and B are the rating vectors for two movies.
- The dot product is the sum of the product of corresponding elements from the two vectors.
- The denominator normalizes the values to account for the magnitudes of the vectors.
Approach and Code Workflow
Step 1: Load the Data
I used the MovieLens dataset from Kaggle, which contains user ratings for movies. This dataset has information on users, movies, and the ratings given by users to different movies. Download, unzip and put it in the dataset
directory of your project.
import pandas as pd
# Load the ratings dataset
ratings = pd.read_csv('dataset/ml-latest-small/ratings.csv')
# Load the movies dataset (optional for movie names)
movies = pd.read_csv('dataset/ml-latest-small/movies.csv')
Step 2: Create the User-Item Matrix
I created a matrix where rows represent users, columns represent movies, and the values represent the ratings given by users to movies.
user_item_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')
user_item_matrix.fillna(0, inplace=True)
Step 3: Calculate Cosine Similarity
To recommend movies based on similar ones, I used Cosine Similarity to calculate how similar the movies are based on their ratings.
from sklearn.metrics.pairwise import cosine_similarity
# Calculate the cosine similarity between items (movies)
item_similarity = cosine_similarity(user_item_matrix.T) # Transpose to get movie-to-movie similarity
item_similarity_df = pd.DataFrame(item_similarity, index=user_item_matrix.columns, columns=user_item_matrix.columns)
Step 4: Make Predictions Based on Similarity
To predict how a user would rate a movie they haven’t rated yet, I used the similarity between movies and the ratings the user has given to similar movies.
import numpy as np
# Predict ratings
def predict_ratings(user_item_matrix, similarity_matrix):
return np.dot(user_item_matrix, similarity_matrix) / np.abs(similarity_matrix).sum(axis=1)
# Make predictions using item similarity
predicted_ratings = predict_ratings(user_item_matrix.values, item_similarity)
# Convert the predictions back into a DataFrame for readability
predicted_ratings_df = pd.DataFrame(predicted_ratings, index=user_item_matrix.index, columns=user_item_matrix.columns)
Step 5: Evaluate the Model
I evaluated the model using Root Mean Squared Error (RMSE). RMSE tells us how far off our predicted ratings are from the actual ratings. The lower the RMSE, the better the model.
from sklearn.metrics import mean_squared_error
# Flatten the matrices and calculate RMSE
true_ratings = user_item_matrix.values.flatten()
predicted_ratings = predicted_ratings_df.values.flatten()
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(true_ratings[true_ratings > 0], predicted_ratings[true_ratings > 0]))
print(f"Root Mean Squared Error: {rmse}")
Unfortunately, the RMSE came out to be 9.89, which is quite high, given that the ratings in the dataset range from 1 to 5. This suggests that the model’s predictions were not very accurate.
Model Performance
The RMSE value of 9.89 means the predicted ratings are quite far off from the actual ratings, indicating that this simple collaborative filtering model isn’t performing very well. There are several potential improvements we could make, such as:
- Using Advanced Algorithms: Models like Matrix Factorization (SVD) or ALS (Alternating Least Squares) handle sparse data better and could reduce the error.
- Feature Engineering: We could add additional features, such as user preferences, genres, or movie popularity, to improve the accuracy of the predictions.
Gratitude
This project was a great learning experience, even though the model didn’t perform as expected. I’m looking forward to diving deeper into more advanced recommendation algorithms in future projects.
Stay tuned!
Posts in this series
- Day 26- Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
- Day 25 - Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
- Day 24 - K-Means Clustering to Segment Customers Based on Behavior
- Day 23 - Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
- Day 22 - Recommender System With Matrix Factorization
- Day 21 - Deploy a Machine Learning Model Using FastAPI and Heroku for Real-Time Predictions
- Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
- Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
- Day 18 - 30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
- Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
- Day 16 - 30 Days 30 ML Projects: Real-Time Face Detection in a Webcam Feed Using OpenCV
- Day 15 - 30 Days 30 ML Projects: Predict House Prices With XGBoost
- Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
- Day 13 - 30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
- Day 12 - 30 Days 30 Machine Learning Projects Challenge
- Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
- Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
- Day 9 - 30 Days 30 Machine Learning Projects
- Day 8 - 30 Days 30 Machine Learning Projects
- Day 7 - 30 Days 30 Machine Learning Projects
- Day 6 - 30 Days 30 Machine Learning Projects
- Day 5 - 30 Days 30 Machine Learning Projects
- Day 4 - 30 Days 30 Machine Learning Projects
- Day 3 - 30 Days 30 Machine Learning Projects
- Day 2 - 30 Days 30 Machine Learning Projects
- Day 1 - 30 Days 30 Machine Learning Projects
- 30 Days 30 Machine Learning Projects Challenge