Day 6  30 Days 30 Machine Learning Projects
Hey, it’s Day 6 of the 30 Day 30 Machine Learning Projects Challenge. Today’s problem was “Predict wine quality from physicochemical properties using SVM”. This is the 5th classification problem in a row. We will learn what SVM is and how it works, along with other important machine learning techniques.
If you want to go straight to the code, I’ve uploaded it to this repository GIT REPO
The process will be the same as I briefly explained in the previous progress posts. I’ll use ChatGPT and ask followup questions.
Talk about the Problem Please!
Today, we used Support Vector Machines (SVM) to predict the quality of wine based on its physicochemical properties (like acidity, sugar, and alcohol content). The goal was to build a model that could classify wine into different quality categories (from 0 to 10) using SVM.
What is SVM?
Support Vector Machines (SVM) are powerful classifiers that find the best boundary (hyperplane) between different classes. Imagine you have data points scattered in space, and you need to separate them into different groups. SVM draws a line (in 2D) or a plane (in 3D) that best divides these points.
In our case, we used SVC (Support Vector Classifier), a type of SVM designed for classification tasks. To handle the nonlinearity of our data, we used the RBF (Radial Basis Function) kernel, which creates curved decision boundaries to separate complex data.
Understanding the Data
We used the Wine Quality Dataset from Kaggle, which contains the physicochemical properties of wine and the corresponding quality ratings. The features include attributes like acidity, sugar levels, and alcohol content, and the target is the wine quality score. Download it locally and put it in the dataset
directory at root level of this repository.
Code Workflow
The process was divided into several steps:
 Load the data
 Preprocess the data
 Data Preprocessing: Feature scaling
 Split the data into training and validation sets
 Create and train the SVM model
 Make predictions and evaluate the model
 Visualization
Step 1: Load the Data
I loaded the wine quality dataset using pandas:
data_df = pd.read_csv('dataset/WineQT.csv', sep=',')
Here’s how it looks:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality Id
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 0
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 1
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 2
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 3
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 4
Step 2: Preprocess the Data
We separated the features (physicochemical properties) from the target (quality):
X = data_df.drop('quality', axis=1)
y = data_df['quality']
Step 3: Data Preprocessing: Scaling the Data
To make sure all features contribute equally, we applied StandardScaler
to standardize the data. This is called scaling transformation, which is the process of transforming your data so that all features (variables) are on a similar scale or range. It’s commonly done in machine learning to ensure that no feature dominates the others simply because of its larger numerical range.
More technically, StandardScaler
ensures that all features contribute equally by transforming the data to have a mean of 0 and a standard deviation of 1.
Let’s understand scaling with an example:
Suppose we have a small dataset with two features: height and weight. The values for these features are in different scales. Height has a larger numerical range than weight.
 Height (cm): 160, 170, 150, 180, 175
 Weight (kg): 65, 70, 55, 85, 75
Before Scaling
Let’s calculate the mean and standard deviation for each feature:
Height:
 Mean: 167
 Standard Deviation: 11.18
Weight:
 Mean: 70
 Standard Deviation: 10
After Applying StandardScaler
:
For each value, we use the formula:
Scaled Value = (Original Value  Mean) / Standard Deviation
For example,
 For height 160, scaled value will be (160  167) / 11.18 ~ 0.63
 For weight 65, scaled value will be (65  70) / 10 ~ 0.5
Here’s how the scaled values would look:
 Heights: 0.63, 0.27, 1.52, 1.16, 0.72
 Weights: 0.5, 0, 1.5, 1.5, 0.5
Visualizing the Output

Original Data:
 Height ranges from 150 to 180 cm.
 Weight ranges from 55 to 85 kg.

After Scaling:
 The transformed height and weight values are now centered around 0, and their standard deviations are 1.
 This ensures that the data has zero mean and unit variance, meaning all features are on the same scale.
Now, let’s code it up:
standard_scale = StandardScaler()
X_scaled = standard_scale.fit_transform(X)
Step 4: Split the Data
We divided the data into an 8020 ratio: training (80%) and validation (20%) sets using:
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
Here, random_state=42 sets the seed for randomness. This ensures the same split occurs on every run. The number 42 is commonly used but has no special meaning.
Step 5: Create and Train the SVM Model
We used the SVM classifier with the RBF kernel (kernel=‘rbf’). This kernel helps the model deal with nonlinear data by creating curved decision boundaries.
model = SVC(kernel='rbf')
model.fit(X_train, y_train)
Step 6: Make Predictions and Evaluate
After training, we used the model to predict the wine quality for the validation set. We calculated accuracy and generated a confusion matrix to understand the model’s performance better.
predictions = model.predict(X_val)
accuracy_score = accuracy_score(y_val, predictions)
print("Accuracy Score:
", accuracy_score)
confusion_matrix = confusion_matrix(y_val, predictions)
print("Confusion Matrix:
", confusion_matrix)
classification_report = classification_report(y_val, predictions, zero_division=0)
print("Classification Report:
", classification_report)
We also used the zero_division=0
parameter to avoid warnings when a certain quality label might not be predicted.
Step 7: Visualization
Finally, we visualized the confusion matrix using seaborn to see how the model performed across different wine quality levels:
plt.figure(figsize=(8,7))
sns.heatmap(confusion_matrix, annot=True, fmt='d', cmap='Reds', xticklabels=sorted(y.unique()), yticklabels=sorted(y.unique()))
plt.xlabel('Predicted Quality')
plt.ylabel('Actual Quality')
plt.title('Confusion Matrix')
plt.show()
Model Performance
Accuracy Score:
0.6593886462882096
Confusion Matrix:
[[ 0 3 3 0 0]
[ 0 72 24 0 0]
[ 0 27 69 3 0]
[ 0 1 15 10 0]
[ 0 0 1 1 0]]
Classfication Report:
precision recall f1score support
4 0.00 0.00 0.00 6
5 0.70 0.75 0.72 96
6 0.62 0.70 0.65 99
7 0.71 0.38 0.50 26
8 0.00 0.00 0.00 2
accuracy 0.66 229
macro avg 0.41 0.37 0.38 229
weighted avg 0.64 0.66 0.64 229
Key Takeaways
 SVM is a powerful algorithm for classification, especially with the RBF kernel, which handles nonlinear data effectively.
 Scaling the features is important to ensure all variables contribute equally.
 More advanced models or tuning the hyperparameters might improve predictions further.
Gratitude
It was a great learning experience working with SVM today. Looking forward to next problem.
Stay Tuned!
Posts in this series
 Day 26 Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
 Day 25  Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
 Day 24  KMeans Clustering to Segment Customers Based on Behavior
 Day 23  Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
 Day 22  Recommender System With Matrix Factorization
 Day 21  Deploy a Machine Learning Model Using FastAPI and Heroku for RealTime Predictions
 Day 20  30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
 Day 19  30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
 Day 18  30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
 Day 17  30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
 Day 16  30 Days 30 ML Projects: RealTime Face Detection in a Webcam Feed Using OpenCV
 Day 15  30 Days 30 ML Projects: Predict House Prices With XGBoost
 Day 14  30 Days 30 ML Projects: Cluster Grocery Store Customers With KMeans
 Day 13  30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
 Day 12  30 Days 30 Machine Learning Projects Challenge
 Day 11  30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
 Day 10  30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
 Day 9  30 Days 30 Machine Learning Projects
 Day 8  30 Days 30 Machine Learning Projects
 Day 7  30 Days 30 Machine Learning Projects
 Day 6  30 Days 30 Machine Learning Projects
 Day 5  30 Days 30 Machine Learning Projects
 Day 4  30 Days 30 Machine Learning Projects
 Day 3  30 Days 30 Machine Learning Projects
 Day 2  30 Days 30 Machine Learning Projects
 Day 1  30 Days 30 Machine Learning Projects
 30 Days 30 Machine Learning Projects Challenge