Day 6 - 30 Days 30 Machine Learning Projects

Mon, Sep 16, 2024
6-minute read

Hey, it’s Day 6 of the 30 Day 30 Machine Learning Projects Challenge. Today’s problem was “Predict wine quality from physicochemical properties using SVM”. This is the 5th classification problem in a row. We will learn what SVM is and how it works, along with other important machine learning techniques.

If you want to go straight to the code, I’ve uploaded it to this repository GIT REPO

The process will be the same as I briefly explained in the previous progress posts. I’ll use ChatGPT and ask follow-up questions.

Talk about the Problem Please!

Today, we used Support Vector Machines (SVM) to predict the quality of wine based on its physicochemical properties (like acidity, sugar, and alcohol content). The goal was to build a model that could classify wine into different quality categories (from 0 to 10) using SVM.

What is SVM?

Support Vector Machines (SVM) are powerful classifiers that find the best boundary (hyperplane) between different classes. Imagine you have data points scattered in space, and you need to separate them into different groups. SVM draws a line (in 2D) or a plane (in 3D) that best divides these points.

In our case, we used SVC (Support Vector Classifier), a type of SVM designed for classification tasks. To handle the non-linearity of our data, we used the RBF (Radial Basis Function) kernel, which creates curved decision boundaries to separate complex data.

Understanding the Data

We used the Wine Quality Dataset from Kaggle, which contains the physicochemical properties of wine and the corresponding quality ratings. The features include attributes like acidity, sugar levels, and alcohol content, and the target is the wine quality score. Download it locally and put it in the dataset directory at root level of this repository.

Code Workflow

The process was divided into several steps:

Load the data
Preprocess the data
Data Preprocessing: Feature scaling
Split the data into training and validation sets
Create and train the SVM model
Make predictions and evaluate the model
Visualization

Step 1: Load the Data

I loaded the wine quality dataset using pandas:

data_df = pd.read_csv('dataset/WineQT.csv', sep=',')

Here’s how it looks:

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  alcohol  quality  Id
0            7.4              0.70         0.00             1.9      0.076                 11.0                  34.0   0.9978  3.51       0.56      9.4        5   0
1            7.8              0.88         0.00             2.6      0.098                 25.0                  67.0   0.9968  3.20       0.68      9.8        5   1
2            7.8              0.76         0.04             2.3      0.092                 15.0                  54.0   0.9970  3.26       0.65      9.8        5   2
3           11.2              0.28         0.56             1.9      0.075                 17.0                  60.0   0.9980  3.16       0.58      9.8        6   3
4            7.4              0.70         0.00             1.9      0.076                 11.0                  34.0   0.9978  3.51       0.56      9.4        5   4

Step 2: Preprocess the Data

We separated the features (physicochemical properties) from the target (quality):

X = data_df.drop('quality', axis=1)
y = data_df['quality']

Step 3: Data Preprocessing: Scaling the Data

To make sure all features contribute equally, we applied StandardScaler to standardize the data. This is called scaling transformation, which is the process of transforming your data so that all features (variables) are on a similar scale or range. It’s commonly done in machine learning to ensure that no feature dominates the others simply because of its larger numerical range.

More technically, StandardScaler ensures that all features contribute equally by transforming the data to have a mean of 0 and a standard deviation of 1.

Let’s understand scaling with an example:

Suppose we have a small dataset with two features: height and weight. The values for these features are in different scales. Height has a larger numerical range than weight.

Height (cm): 160, 170, 150, 180, 175
Weight (kg): 65, 70, 55, 85, 75

Before Scaling

Let’s calculate the mean and standard deviation for each feature:

Height:

Mean: 167
Standard Deviation: 11.18

Weight:

Mean: 70
Standard Deviation: 10

After Applying `StandardScaler`:

For each value, we use the formula:

Scaled Value = (Original Value - Mean) / Standard Deviation

For example,

For height 160, scaled value will be (160 - 167) / 11.18 ~ -0.63
For weight 65, scaled value will be (65 - 70) / 10 ~ -0.5

Here’s how the scaled values would look:

Heights: -0.63, 0.27, -1.52, 1.16, 0.72
Weights: -0.5, 0, -1.5, 1.5, 0.5

Visualizing the Output

Original Data:
- Height ranges from 150 to 180 cm.
- Weight ranges from 55 to 85 kg.
After Scaling:
- The transformed height and weight values are now centered around 0, and their standard deviations are 1.
- This ensures that the data has zero mean and unit variance, meaning all features are on the same scale.

Now, let’s code it up:

standard_scale = StandardScaler()
X_scaled = standard_scale.fit_transform(X)

Step 4: Split the Data

We divided the data into an 80-20 ratio: training (80%) and validation (20%) sets using:

X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Here, random_state=42 sets the seed for randomness. This ensures the same split occurs on every run. The number 42 is commonly used but has no special meaning.

Step 5: Create and Train the SVM Model

We used the SVM classifier with the RBF kernel (kernel=‘rbf’). This kernel helps the model deal with non-linear data by creating curved decision boundaries.

model = SVC(kernel='rbf')
model.fit(X_train, y_train)

Step 6: Make Predictions and Evaluate

After training, we used the model to predict the wine quality for the validation set. We calculated accuracy and generated a confusion matrix to understand the model’s performance better.

predictions = model.predict(X_val)
accuracy_score = accuracy_score(y_val, predictions)
print("Accuracy Score:
", accuracy_score)

confusion_matrix = confusion_matrix(y_val, predictions)
print("Confusion Matrix:
", confusion_matrix)

classification_report = classification_report(y_val, predictions, zero_division=0)
print("Classification Report:
", classification_report)

We also used the zero_division=0 parameter to avoid warnings when a certain quality label might not be predicted.

Step 7: Visualization

Finally, we visualized the confusion matrix using seaborn to see how the model performed across different wine quality levels:

plt.figure(figsize=(8,7))
sns.heatmap(confusion_matrix, annot=True, fmt='d', cmap='Reds', xticklabels=sorted(y.unique()), yticklabels=sorted(y.unique()))
plt.xlabel('Predicted Quality')
plt.ylabel('Actual Quality')
plt.title('Confusion Matrix')
plt.show()

Model Performance

Accuracy Score:
 0.6593886462882096
Confusion Matrix:
 [[ 0  3  3  0  0]
 [ 0 72 24  0  0]
 [ 0 27 69  3  0]
 [ 0  1 15 10  0]
 [ 0  0  1  1  0]]
Classfication Report:
               precision    recall  f1-score   support

           4       0.00      0.00      0.00         6
           5       0.70      0.75      0.72        96
           6       0.62      0.70      0.65        99
           7       0.71      0.38      0.50        26
           8       0.00      0.00      0.00         2

    accuracy                           0.66       229
   macro avg       0.41      0.37      0.38       229
weighted avg       0.64      0.66      0.64       229

Key Takeaways

SVM is a powerful algorithm for classification, especially with the RBF kernel, which handles non-linear data effectively.
Scaling the features is important to ensure all variables contribute equally.
More advanced models or tuning the hyperparameters might improve predictions further.

Gratitude

It was a great learning experience working with SVM today. Looking forward to next problem.

Stay Tuned!

Posts in this series

30-days-ml-challenge