Day 5 - 30 Days 30 Machine Learning Projects
Good Morning, It’s Day 5 of the 30 Day 30 Machine Learning Projects Challenge.
If you want to go straight to the code, I’ve uploaded it to this repository GIT REPO
The process will be the same as I briefly explained in the previous progress posts. I’ll use ChatGPT and ask follow-up questions.
Talk about the Problem Please!
Today’s problem was “Filter spam from a collection of emails using Naive Bayes”. This is the fourth classification problem in a row. Today we will learn about Naive Bais Classifier.
Naive Bayes is a simple yet powerful algorithm often used for text classification. It uses probabilities to predict which class (spam or not spam) an email belongs to, based on the words in that email.
Here’s a simple analogy:
You have a box of fruits, and you know how often apples and oranges appear in that box. If someone hands you a fruit, you can guess whether it’s an apple or orange based on the features (like color or size). Similarly, Naive Bayes guesses whether an email is spam or not based on the frequency of certain words like “win,” “free,” or “meeting.”
Undestanding the Data
Since I didn’t have a large dataset initially, I created a small dataset of 20 emails (10 spam and 10 non-spam) to get started. Each email was labeled as either spam or ham (non-spam). Here’s a sample of that dataset:
- Spam: “Win a $1000 Walmart gift card! Click here to claim now!”
- Non-Spam: “Hey, are you coming to the meeting tomorrow?”
Code Workflow
The workflow is divided into six steps:
- Create the Data
- Preprocessing
- Convert the text data into numerical features
- Split data in training and validation sets
- Create and Train Model
- Make Predictions and Evaluate
- Visualization
Let’s understand each step:
Step 1: Create the Data
I created a small dataset manually, using a dictionary to represent emails and their labels (spam or non-spam). For this small dataset, I used pandas to load the data into a DataFrame:
data = {
'label': ['spam', 'ham', 'spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'spam',
'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'ham'],
'email': [
'Win a $1000 Walmart gift card! Click here to claim now!',
'Hey, are you coming to the party tonight?',
'Congratulations! You have won a free vacation to the Bahamas!',
'Can we reschedule our meeting to 3 PM?',
'Your Amazon order has been shipped.',
'You have been selected for a cash prize! Call now to claim.',
'Urgent! Your account has been compromised, please reset your password.',
'Don’t forget about the doctor’s appointment tomorrow.',
'Your package is out for delivery.',
'Get rich quick by investing in this opportunity. Don’t miss out!',
'Can you send me the latest project report?',
'Exclusive offer! Buy one, get one free on all items.',
'Are you free for lunch tomorrow?',
'Claim your free iPhone now by clicking this link!',
'I’ll call you back in 5 minutes.',
'Get a $500 loan approved instantly. No credit check required!',
'Hurry! Limited-time offer, act now to win a $1000 gift card.',
'Let’s catch up over coffee this weekend.',
'You’ve been pre-approved for a personal loan. Apply today!',
'Meeting reminder for Monday at 10 AM.'
]
}
data_df = pd.DataFrame(data)
Step 2: Preprocess the Data
For binary classification, we mapped spam emails to 1 and non-spam (ham) emails to 0. This step allows the model to understand which class each email belongs to:
data_df['label'] = data_df['label'].map({'ham': 0, 'spam': 1})
Step 3: Convert the text data into numerical features.
Emails are in text form, so we need to convert them into numerical data. For this, I used the CountVectorizer
from sklearn, which creates a matrix of word counts for each email (a “bag of words” model).
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(data_df['email']) # Convert the text into a bag of words matrix.
print("Vocabulary:\n", vectorizer.get_feature_names_out())
print("Count Matrix\n", X.toarray())
This transforms each email into a row of numbers, where each number represents how many times a word from the vocabulary appeared in that email.
Step 4: Split data
I divided the data into an 80-20 ratio: training (80%) and validation (20%) sets using:
X_train, X_val, y_train, y_val = train_test_split(X, data_df['label'], test_size=0.2, random_state=42)
Here, random_state=42 sets the seed for randomness. This ensures the same split occurs on every run. The number 42 is commonly used but has no special meaning.
Step 5: Create and Train Model
For this task, I used the Multinomial
Naive Bayes model from sklearn. It’s particularly well-suited for text data:
model = MultinomialNB()
model.fit(X_train, y_train)
The model learned from the training data, looking at how often certain words (like “win,” “free,” or “meeting”) appeared in spam and non-spam emails.
Step 6: Make Predictions and Evaluate
After training, I used the model to predict whether the emails in the test set were spam or not. To evaluate the model’s performance, I calculated its accuracy and created a confusion matrix:
predictions = model.predict(X_val)
accuracy_score = accuracy_score(y_val, predictions)
print("Accuracy Score:\n", accuracy_score)
confusion_matrix = confusion_matrix(y_val, predictions)
print("\nConfusion Matrix:\n", confusion_matrix)
Accuracy alone doesn’t tell the full story. That’s why I also used a confusion matrix to get a better picture of how the model handled spam and non-spam emails.
Here’s a breakdown of the confusion matrix:
- True Positives (TP): Correctly predicted spam emails.
- True Negatives (TN): Correctly predicted non-spam emails.
- False Positives (FP): Non-spam emails incorrectly predicted as spam.
- False Negatives (FN): Spam emails incorrectly predicted as non-spam.
Our model’s performance:
Accuracy Score:
1.0
Confusion Matrix:
[[2 0]
[0 2]]
Step 7: Visualization
I used matplotlib.pyp
and seaborn
to create a heatmap of the confusion matrix.
We can see the model predicts:
- True Positives (TP): 2 correctly predicted spam emails
- True Negatives (TN): 2 correctly predicted non-spam emails
Gratitude
The Naive Bayes algorithm is simple yet effective for text classification problems like spam filtering. While the model worked well on this small dataset. I will try it on larger dataset in future posts.
Stay Tuned!!
Posts in this series
- Day 26- Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
- Day 25 - Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
- Day 24 - K-Means Clustering to Segment Customers Based on Behavior
- Day 23 - Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
- Day 22 - Recommender System With Matrix Factorization
- Day 21 - Deploy a Machine Learning Model Using FastAPI and Heroku for Real-Time Predictions
- Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
- Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
- Day 18 - 30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
- Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
- Day 16 - 30 Days 30 ML Projects: Real-Time Face Detection in a Webcam Feed Using OpenCV
- Day 15 - 30 Days 30 ML Projects: Predict House Prices With XGBoost
- Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
- Day 13 - 30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
- Day 12 - 30 Days 30 Machine Learning Projects Challenge
- Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
- Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
- Day 9 - 30 Days 30 Machine Learning Projects
- Day 8 - 30 Days 30 Machine Learning Projects
- Day 7 - 30 Days 30 Machine Learning Projects
- Day 6 - 30 Days 30 Machine Learning Projects
- Day 5 - 30 Days 30 Machine Learning Projects
- Day 4 - 30 Days 30 Machine Learning Projects
- Day 3 - 30 Days 30 Machine Learning Projects
- Day 2 - 30 Days 30 Machine Learning Projects
- Day 1 - 30 Days 30 Machine Learning Projects
- 30 Days 30 Machine Learning Projects Challenge