Day 3 - 30 Days 30 Machine Learning Projects

Good Morning! It is Day 3 of the 30 Day 30 Machine Learning Projects Challenge, and it is going great. I woke up at 5:17. All credit goes to my cat, Green, for scratching my head with his paws for his morning hunt. :)

If you want to go straight to the code, I’ve uploaded it to this repository GIT REPO

The flow is going to be the same as I had briefly explained in the Day 1 and Day 2 progress posts. I will be using ChatGPT and moving forward with follow-up questions.

Talk about the Problem Please!

The problem of the day was “Recognizing handwritten digits with k-Nearest Neighbors on MNIST”. It is another classic machine learning problem. Here, we have to predict handwritten digits using the K-Nearest Neighbors Algorithm. It also requires using MNIST data.

Wikipedia: The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

Undestanding the Data

I used the data from scikit-learn’s datasets fetch_openml with arguments:

mnist_784: This indicates that we want the MNIST data in which 28x28 size images are flattened into 784-feature vectors.
version=1: I specified that I wanted version 1 of the MNIST data.
as_frame=True: I specified that I wanted it in Panda DataFrame format, as it is easier to debug, visualize, and manipulate Panda DataFrames.
parser='auto': On my local machine, I was getting a warning about the parser version, so I set it to auto to pick the one that works best for the environment.

Code Workflow

The workflow is divided into seven steps:

Load the MNIST data
Preprocess the Data
Normalize the Data
Split data in training and validation sets
Create and Train Model
Make Predictions and Evaluate
Visualization

Let’s understand each step:

Step 1: Load the MNIST data

I have already mentioned in brief that I am using fetch_openml. See the Understand the Data section.

Step 2: Preprocess the Data

I had mentioned in Step 1 to load the data as Panda DataFrame, I use mnist_data.keys() to know about the list of keys the data contains. It has the following fields:

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

I used the data and target to build my features (X) and target (y) sets.

Step 3: Normalize the Data

When dealing with image data, pixel values can range from 0 to 255 for 8-bit grayscale images. Normalizing these pixel values to the range between 0 and 1 is a common preprocessing step in machine learning tasks, particularly for algorithms that are sensitive to the scale of the input data, like k-Nearest Neighbors (k-NN).

I did X /= 255.0

Step 4: Split data.

I divided the data into an 80-20 ratio, that is, traning (80%) and validation (20%) sets using train_test_split(X, y, test_size=0.2, random_state=42)

Here random_state=42 is used to set the seed for randomness. It will ensure that the same split occurs on every run. The number 42 is a commonly used arbitrary number.

No logic behind it.

Step 5: Create and Train Model

I am using KNeighborsClassifier from scikit-learn neighbors package.

k-NN is a simple, instance-based learning algorithm that classifies new cases based on the majority votes of the k nearest neighbor samples from the training dataset. The ’nearest neighbors’ are determined by a distance metric, typically Euclidean distance. Here, K is user-defined.

Initially, I chose K=3.

Step 6: Make Predictions and Evaluate

I used a variable named predictions to store the predicted values against the 20% validation data X_val.

Since it is a classification type of model, relying on accuracy alone is not sufficient. I used a Confusion Matrix to learn more about the efficiency of the model.

A Confusion Matrix is helpful because it shows True Positives, False Positives, True Negatives, and False Negatives. I know it can be a little difficult to understand; please use the resources mentioned below to grasp it better.

Visualization

I used matplotlib.pyp and seaborn to create a heatmap of the confusion matrix. See how it looks.

Outcome of Experimenting with Different K.

At K=3, Accuracy: 0.9712857142857143
At K=2, Accuracy: 0.9642142857142857
At K=1, Accuracy: 0.972
At K=5, Accuracy: 0.9700714285714286
At K=10, Accuracy: 0.9657857142857142

I decided to stick with K=3.

Gratitude

Today, I was feeling confident writing the code and using libraries. It is the second problem on classification; maybe that has helped. I solved it in under 40 minutes, but then I started experimenting with different K values. It was fun. I am now enjoying the process and looking forward to solving more problems.

Stay Tuned!!