Day 22 - Recommender System With Matrix Factorization
Today, I built a Recommender System using Matrix Factorization to predict how users will rate items they haven’t interacted with. I used Singular Value Decomposition (SVD), a popular matrix factorization technique, to decompose the user-item interaction matrix and predict missing ratings. The goal was to recommend movies to users by predicting the ratings they might give to movies they haven’t watched yet.
If you want to see the code, you can find it here: GIT REPO.
Understanding Dataset:
I used the MovieLens small dataset, which contains user ratings for various movies. This dataset is widely used in recommendation system projects because it contains user-item interactions, which are perfect for collaborative filtering.
Step-by-Step Approach:
Step 1: Loading the Dataset
First, I loaded the ratings.csv
file, which contains user ratings for different movies. It has the following columns:
- userId: Unique identifier for each user.
- movieId: Unique identifier for each movie.
- rating: The rating given by the user to the movie (scale from 0.5 to 5.0).
# Step 1: Load the MovieLens dataset.
ratings = pd.read_csv('dataset/ratings.csv')
print(ratings.head())
Step 2: Preparing Data for the Surprise Library
To use the Surprise library, which is specifically designed for recommendation systems, I needed to format the data properly:
- Reader: Defines the scale of ratings (0.5 to 5.0 in this case).
Dataset.load_from_df()
: Converts the pandas DataFrame into the format required by Surprise for further processing.
# Step 2: Prepare the data for Surprise library.
reader = Reader(rating_scale=(ratings['rating'].min(), ratings['rating'].max()))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
Step 3: Splitting the Data for Training and Validation
I split the dataset into training and validation sets to evaluate the model’s performance. The training set is used to build the model, and the validation set is used to test how well the model generalizes to unseen data.
# Step 3: Split data into train-validation datasets
trainset, valset = train_test_split(data, test_size=0.2)
Step 4: Applying SVD for Matrix Factorization
SVD (Singular Value Decomposition)
is a matrix factorization technique that helps break down a large user-item matrix into lower-dimensional matrices. This technique uncovers latent factors that represent hidden relationships between users and items.
- SVD decomposes the matrix into three smaller matrices: one for users, one for items, and one diagonal matrix of singular values.
- The model can then make predictions by reconstructing the matrix and predicting the missing values (i.e., ratings).
To evaluate the model, I used cross-validation, which splits the dataset into different parts, trains the model on some parts, and tests it on others. I measured the model’s accuracy using RMSE (Root Mean Squared Error) and MSE (Mean Squared Error).
# Step 4: Matrix Factorization using (Singular Value Decomposition) SVD
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MSE'], cv=5, verbose=True)
Cross-validation helps evaluate the performance of the model across multiple splits of the data, ensuring that the model generalizes well.
Cross-validation results:
- RMSE measures the average magnitude of prediction error. Lower values indicate better predictions.
- MSE measures the average squared difference between the predicted and actual ratings.
Fold 1 RMSE: 0.8724 MSE: 0.7611
Fold 2 RMSE: 0.8690 MSE: 0.7552
Fold 3 RMSE: 0.8734 MSE: 0.7628
Fold 4 RMSE: 0.8805 MSE: 0.7752
Fold 5 RMSE: 0.8733 MSE: 0.7627
Step 5: Training the Model on the Full Dataset
Once I evaluated the model using cross-validation, I retrained the SVD model on the entire dataset to maximize the amount of data the model sees.
# Step 5: Train the model.
trainset_full = data.build_full_trainset()
svd.fit(trainset_full)
Step 6: Making Predictions
After training, I used the model to predict how user 1 would rate movie 6, a movie they had previously rated. The prediction was very close to the actual rating:
# Step 6: Make Predictions
user_id = 1
movie_id = 6
prediction = svd.predict(user_id, movie_id)
print(f"Prediction for user {user_id} and movie {movie_id} is: {prediction}")
Output:
Prediction for user 1 and movie 6 is: user: 1 item: 6 r_ui = None est = 4.49 {'was_impossible': False}
The model predicted a rating of 4.49, which is close to the actual rating of 4 given by user 1 for movie 6.
Step 7: Evaluating the Model
Finally, I evaluated the model on the validation set by calculating the RMSE. The RMSE for the validation set was 0.641, indicating that the model’s predictions are quite close to the actual ratings.
# Step 7: Evaluate the model on validation data
val_predictions = svd.test(valset)
rmse = accuracy.rmse(val_predictions)
print(f"Root mean square error for validation data: {rmse}")
Output:
Root mean square error for validation data: 0.6410624356100669
Results
The RMSE of 0.641 on the validation set indicates that the model’s predictions are off by around 0.641 rating points on average, which is a good level of accuracy given that the ratings are on a scale from 0.5 to 5.0.
Gratitude
It’s the first day of exploring advanced models. I couldn’t grasp the details of how SVD works and why it is effective. I plan to cover this topic in depth after the challenge.
Stay tuned for the next problem!
Posts in this series
- Day 26- Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
- Day 25 - Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
- Day 24 - K-Means Clustering to Segment Customers Based on Behavior
- Day 23 - Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
- Day 22 - Recommender System With Matrix Factorization
- Day 21 - Deploy a Machine Learning Model Using FastAPI and Heroku for Real-Time Predictions
- Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
- Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
- Day 18 - 30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
- Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
- Day 16 - 30 Days 30 ML Projects: Real-Time Face Detection in a Webcam Feed Using OpenCV
- Day 15 - 30 Days 30 ML Projects: Predict House Prices With XGBoost
- Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
- Day 13 - 30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
- Day 12 - 30 Days 30 Machine Learning Projects Challenge
- Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
- Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
- Day 9 - 30 Days 30 Machine Learning Projects
- Day 8 - 30 Days 30 Machine Learning Projects
- Day 7 - 30 Days 30 Machine Learning Projects
- Day 6 - 30 Days 30 Machine Learning Projects
- Day 5 - 30 Days 30 Machine Learning Projects
- Day 4 - 30 Days 30 Machine Learning Projects
- Day 3 - 30 Days 30 Machine Learning Projects
- Day 2 - 30 Days 30 Machine Learning Projects
- Day 1 - 30 Days 30 Machine Learning Projects
- 30 Days 30 Machine Learning Projects Challenge