Day 6 - 30 Days 30 Machine Learning Projects
Hey, it’s Day 6 of the 30 Day 30 Machine Learning Projects Challenge. Today’s problem was “Predict wine quality from physicochemical properties using SVM”. This is the 5th classification problem in a row. We will learn what SVM is and how it works, along with other important machine learning techniques.
If you want to go straight to the code, I’ve uploaded it to this repository GIT REPO
The process will be the same as I briefly explained in the previous progress posts. I’ll use ChatGPT and ask follow-up questions.
Talk about the Problem Please!
Today, we used Support Vector Machines (SVM) to predict the quality of wine based on its physicochemical properties (like acidity, sugar, and alcohol content). The goal was to build a model that could classify wine into different quality categories (from 0 to 10) using SVM.
What is SVM?
Support Vector Machines (SVM) are powerful classifiers that find the best boundary (hyperplane) between different classes. Imagine you have data points scattered in space, and you need to separate them into different groups. SVM draws a line (in 2D) or a plane (in 3D) that best divides these points.
In our case, we used SVC (Support Vector Classifier), a type of SVM designed for classification tasks. To handle the non-linearity of our data, we used the RBF (Radial Basis Function) kernel, which creates curved decision boundaries to separate complex data.
Understanding the Data
We used the Wine Quality Dataset from Kaggle, which contains the physicochemical properties of wine and the corresponding quality ratings. The features include attributes like acidity, sugar levels, and alcohol content, and the target is the wine quality score. Download it locally and put it in the dataset
directory at root level of this repository.
Code Workflow
The process was divided into several steps:
- Load the data
- Preprocess the data
- Data Preprocessing: Feature scaling
- Split the data into training and validation sets
- Create and train the SVM model
- Make predictions and evaluate the model
- Visualization
Step 1: Load the Data
I loaded the wine quality dataset using pandas:
data_df = pd.read_csv('dataset/WineQT.csv', sep=',')
Here’s how it looks:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality Id
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 0
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 1
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 2
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 3
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 4
Step 2: Preprocess the Data
We separated the features (physicochemical properties) from the target (quality):
X = data_df.drop('quality', axis=1)
y = data_df['quality']
Step 3: Data Preprocessing: Scaling the Data
To make sure all features contribute equally, we applied StandardScaler
to standardize the data. This is called scaling transformation, which is the process of transforming your data so that all features (variables) are on a similar scale or range. It’s commonly done in machine learning to ensure that no feature dominates the others simply because of its larger numerical range.
More technically, StandardScaler
ensures that all features contribute equally by transforming the data to have a mean of 0 and a standard deviation of 1.
Let’s understand scaling with an example:
Suppose we have a small dataset with two features: height and weight. The values for these features are in different scales. Height has a larger numerical range than weight.
- Height (cm): 160, 170, 150, 180, 175
- Weight (kg): 65, 70, 55, 85, 75
Before Scaling
Let’s calculate the mean and standard deviation for each feature:
Height:
- Mean: 167
- Standard Deviation: 11.18
Weight:
- Mean: 70
- Standard Deviation: 10
After Applying StandardScaler
:
For each value, we use the formula:
Scaled Value = (Original Value - Mean) / Standard Deviation
For example,
- For height 160, scaled value will be (160 - 167) / 11.18 ~ -0.63
- For weight 65, scaled value will be (65 - 70) / 10 ~ -0.5
Here’s how the scaled values would look:
- Heights: -0.63, 0.27, -1.52, 1.16, 0.72
- Weights: -0.5, 0, -1.5, 1.5, 0.5
Visualizing the Output
-
Original Data:
- Height ranges from 150 to 180 cm.
- Weight ranges from 55 to 85 kg.
-
After Scaling:
- The transformed height and weight values are now centered around 0, and their standard deviations are 1.
- This ensures that the data has zero mean and unit variance, meaning all features are on the same scale.
Now, let’s code it up:
standard_scale = StandardScaler()
X_scaled = standard_scale.fit_transform(X)
Step 4: Split the Data
We divided the data into an 80-20 ratio: training (80%) and validation (20%) sets using:
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
Here, random_state=42 sets the seed for randomness. This ensures the same split occurs on every run. The number 42 is commonly used but has no special meaning.
Step 5: Create and Train the SVM Model
We used the SVM classifier with the RBF kernel (kernel=‘rbf’). This kernel helps the model deal with non-linear data by creating curved decision boundaries.
model = SVC(kernel='rbf')
model.fit(X_train, y_train)
Step 6: Make Predictions and Evaluate
After training, we used the model to predict the wine quality for the validation set. We calculated accuracy and generated a confusion matrix to understand the model’s performance better.
predictions = model.predict(X_val)
accuracy_score = accuracy_score(y_val, predictions)
print("Accuracy Score:
", accuracy_score)
confusion_matrix = confusion_matrix(y_val, predictions)
print("Confusion Matrix:
", confusion_matrix)
classification_report = classification_report(y_val, predictions, zero_division=0)
print("Classification Report:
", classification_report)
We also used the zero_division=0
parameter to avoid warnings when a certain quality label might not be predicted.
Step 7: Visualization
Finally, we visualized the confusion matrix using seaborn to see how the model performed across different wine quality levels:
plt.figure(figsize=(8,7))
sns.heatmap(confusion_matrix, annot=True, fmt='d', cmap='Reds', xticklabels=sorted(y.unique()), yticklabels=sorted(y.unique()))
plt.xlabel('Predicted Quality')
plt.ylabel('Actual Quality')
plt.title('Confusion Matrix')
plt.show()
Model Performance
Accuracy Score:
0.6593886462882096
Confusion Matrix:
[[ 0 3 3 0 0]
[ 0 72 24 0 0]
[ 0 27 69 3 0]
[ 0 1 15 10 0]
[ 0 0 1 1 0]]
Classfication Report:
precision recall f1-score support
4 0.00 0.00 0.00 6
5 0.70 0.75 0.72 96
6 0.62 0.70 0.65 99
7 0.71 0.38 0.50 26
8 0.00 0.00 0.00 2
accuracy 0.66 229
macro avg 0.41 0.37 0.38 229
weighted avg 0.64 0.66 0.64 229
Key Takeaways
- SVM is a powerful algorithm for classification, especially with the RBF kernel, which handles non-linear data effectively.
- Scaling the features is important to ensure all variables contribute equally.
- More advanced models or tuning the hyperparameters might improve predictions further.
Gratitude
It was a great learning experience working with SVM today. Looking forward to next problem.
Stay Tuned!
Posts in this series
- Day 26- Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
- Day 25 - Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
- Day 24 - K-Means Clustering to Segment Customers Based on Behavior
- Day 23 - Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
- Day 22 - Recommender System With Matrix Factorization
- Day 21 - Deploy a Machine Learning Model Using FastAPI and Heroku for Real-Time Predictions
- Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
- Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
- Day 18 - 30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
- Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
- Day 16 - 30 Days 30 ML Projects: Real-Time Face Detection in a Webcam Feed Using OpenCV
- Day 15 - 30 Days 30 ML Projects: Predict House Prices With XGBoost
- Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
- Day 13 - 30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
- Day 12 - 30 Days 30 Machine Learning Projects Challenge
- Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
- Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
- Day 9 - 30 Days 30 Machine Learning Projects
- Day 8 - 30 Days 30 Machine Learning Projects
- Day 7 - 30 Days 30 Machine Learning Projects
- Day 6 - 30 Days 30 Machine Learning Projects
- Day 5 - 30 Days 30 Machine Learning Projects
- Day 4 - 30 Days 30 Machine Learning Projects
- Day 3 - 30 Days 30 Machine Learning Projects
- Day 2 - 30 Days 30 Machine Learning Projects
- Day 1 - 30 Days 30 Machine Learning Projects
- 30 Days 30 Machine Learning Projects Challenge