Day 22 - Recommender System With Matrix Factorization

Today, I built a Recommender System using Matrix Factorization to predict how users will rate items they haven’t interacted with. I used Singular Value Decomposition (SVD), a popular matrix factorization technique, to decompose the user-item interaction matrix and predict missing ratings. The goal was to recommend movies to users by predicting the ratings they might give to movies they haven’t watched yet.

If you want to see the code, you can find it here: GIT REPO.

Understanding Dataset:

I used the MovieLens small dataset, which contains user ratings for various movies. This dataset is widely used in recommendation system projects because it contains user-item interactions, which are perfect for collaborative filtering.

Step-by-Step Approach:

Step 1: Loading the Dataset

First, I loaded the ratings.csv file, which contains user ratings for different movies. It has the following columns:

  • userId: Unique identifier for each user.
  • movieId: Unique identifier for each movie.
  • rating: The rating given by the user to the movie (scale from 0.5 to 5.0).
# Step 1: Load the MovieLens dataset.
ratings = pd.read_csv('dataset/ratings.csv')
print(ratings.head())

Step 2: Preparing Data for the Surprise Library

To use the Surprise library, which is specifically designed for recommendation systems, I needed to format the data properly:

  • Reader: Defines the scale of ratings (0.5 to 5.0 in this case).
  • Dataset.load_from_df(): Converts the pandas DataFrame into the format required by Surprise for further processing.
# Step 2: Prepare the data for Surprise library.
reader = Reader(rating_scale=(ratings['rating'].min(), ratings['rating'].max()))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

Step 3: Splitting the Data for Training and Validation

I split the dataset into training and validation sets to evaluate the model’s performance. The training set is used to build the model, and the validation set is used to test how well the model generalizes to unseen data.

# Step 3: Split data into train-validation datasets
trainset, valset = train_test_split(data, test_size=0.2)

Step 4: Applying SVD for Matrix Factorization

SVD (Singular Value Decomposition) is a matrix factorization technique that helps break down a large user-item matrix into lower-dimensional matrices. This technique uncovers latent factors that represent hidden relationships between users and items.

  • SVD decomposes the matrix into three smaller matrices: one for users, one for items, and one diagonal matrix of singular values.
  • The model can then make predictions by reconstructing the matrix and predicting the missing values (i.e., ratings).

To evaluate the model, I used cross-validation, which splits the dataset into different parts, trains the model on some parts, and tests it on others. I measured the model’s accuracy using RMSE (Root Mean Squared Error) and MSE (Mean Squared Error).

# Step 4: Matrix Factorization using (Singular Value Decomposition) SVD
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MSE'], cv=5, verbose=True)

Cross-validation helps evaluate the performance of the model across multiple splits of the data, ensuring that the model generalizes well.

Cross-validation results:

  • RMSE measures the average magnitude of prediction error. Lower values indicate better predictions.
  • MSE measures the average squared difference between the predicted and actual ratings.
Fold 1  RMSE: 0.8724  MSE: 0.7611
Fold 2  RMSE: 0.8690  MSE: 0.7552
Fold 3  RMSE: 0.8734  MSE: 0.7628
Fold 4  RMSE: 0.8805  MSE: 0.7752
Fold 5  RMSE: 0.8733  MSE: 0.7627

Step 5: Training the Model on the Full Dataset

Once I evaluated the model using cross-validation, I retrained the SVD model on the entire dataset to maximize the amount of data the model sees.

# Step 5: Train the model.
trainset_full = data.build_full_trainset()
svd.fit(trainset_full)

Step 6: Making Predictions

After training, I used the model to predict how user 1 would rate movie 6, a movie they had previously rated. The prediction was very close to the actual rating:

# Step 6: Make Predictions
user_id = 1
movie_id = 6
prediction = svd.predict(user_id, movie_id)
print(f"Prediction for user {user_id} and movie {movie_id} is: {prediction}")

Output:

Prediction for user 1 and movie 6 is: user: 1  item: 6  r_ui = None  est = 4.49  {'was_impossible': False}

The model predicted a rating of 4.49, which is close to the actual rating of 4 given by user 1 for movie 6.

Step 7: Evaluating the Model

Finally, I evaluated the model on the validation set by calculating the RMSE. The RMSE for the validation set was 0.641, indicating that the model’s predictions are quite close to the actual ratings.

# Step 7: Evaluate the model on validation data
val_predictions = svd.test(valset)
rmse = accuracy.rmse(val_predictions)
print(f"Root mean square error for validation data: {rmse}")

Output:

Root mean square error for validation data: 0.6410624356100669

Results

The RMSE of 0.641 on the validation set indicates that the model’s predictions are off by around 0.641 rating points on average, which is a good level of accuracy given that the ratings are on a scale from 0.5 to 5.0.

Gratitude

It’s the first day of exploring advanced models. I couldn’t grasp the details of how SVD works and why it is effective. I plan to cover this topic in depth after the challenge.

Stay tuned for the next problem!

Posts in this series