Day 1 - 30 Days 30 Machine Learning Projects
Today is the first day of the 30 days 30 ML projects challenge. I got up at 5:30 am, which was 30 minutes later than I planned.
I recorded my screen to keep track of what I did, which helped me write this post. Now I’m thinking about posting it on YouTube like a series of development logs. I’ll share the video link at the end so you can see how I start from scratch. I think that’s pretty cool.
If you want to go straight to the code, I’ve uploaded it to this repository GIT REPO
Flow
I planned to read blogs and tutorials for reference. Then, I realized that I could use ChatGPT.
I asked ChatGPT to help solve the problem, telling it to assume I have a basic understanding of Machine Learning and to start with simple models, getting more complex as we go.
I’m going to use the same context window for future problems. This way, I can make the most of ChatGPT without having to train from scratch each time.
I typed out each line of the code myself, actually copying it, and made changes where needed. If I didn’t understand something, I asked ChatGPT to clarify. This way, I’m learning and will be able to write code on my own for future problems.
Talk about the Problem Please!!
The challenge for day one was to “Predict house prices using Simple Linear Regression”. It is a classic problem in machine learning.
Packages Required.
I installed the necessary packages. Here’s what you need to set up:
pip install pandas scikit-learn matplotlib numpy
Why is it a Linear Regression Problem?
It is clearly a regression problem because predicting house prices results in a continuous outcome, not belonging to any set category.
I chose the Linear Regression model for its simplicity and ease of implementation. Unlike more complex models, it doesn’t require data preprocessing. This makes it an excellent choice for a straightforward Day 1 project.
Understanding the Data
I am using fetch_california_housing
from sklearn.datasets
. The California Housing dataset is a well-known dataset that contains data about houses in California. It includes various features, but for the simplicity of this example, we’ll focus on two key variables:
MedInc: Median income in the block group
MedHouseVal: Median house value for California districts (target variable)
Boston Housing from Kaggle is another excellent option for acquiring a suitable dataset for this problem.
The Code Workflow
The workflow involves six major steps:
- Loading the dataset
- Selecting features and target
- Splitting the dataset
- Creating and training the model
- Evaluating the model’s performance
- Visualizing the results
Let’s dive into each step:
Step 1: Load the Dataset
I used fetch_california_housing
from sklearn.datasets
. I have set the paramter as_frame to true to get the data as a Pandas DataFrame. It will help in analysing the data easily, like with function head(), the table structure with top 5 rows.
california_housing = fetch_california_housing(as_frame=True)
california_housing_df = california_housing.frame
Step 2: Select Features and Target
In Simple Linear Regression, we predict the outcome based on a single feature. Here, I’m using median income (MedInc) as our feature stored in X
, predicting MedHouseVal as our target y
, the median house value.
Step 3: Split the Dataset
I split the data into a training set (80%) and a validation set (20%).
Step 4: Create and Train the Model
Create an instance of LinearRegression model and train it using the fit
method on the training data.
Step 5: Evaluation
After training, i have used Root Mean Squared Error (RMSE) to evaluate the accuracy of the model. Here is the result
The Root Mean Squared error is: 0.8420901241414454
Step 6: Visualization
I used matplotlib.pyplot
package to plot a graph for visualizing the true the true median house values against the predicted values to see how well the model performed.
Gratitude
I finished in 1 hour, which was faster than I planned. I am really happy with this progress and excited to continue the challenge without missing a day.
Stay Tuned!!
Video
Posts in this series
- Day 26- Time Series Forecasting of Electricity Consumption Using LSTM (Intro to Deep Learning)
- Day 25 - Sentiment Analysis of Customer Reviews Using Traditional NLP Techniques
- Day 24 - K-Means Clustering to Segment Customers Based on Behavior
- Day 23 - Fraud Detection in Financial Transactions Using Logistic Regression and Random Forest
- Day 22 - Recommender System With Matrix Factorization
- Day 21 - Deploy a Machine Learning Model Using FastAPI and Heroku for Real-Time Predictions
- Day 20 - 30 Days 30 ML Projects: Create a Topic Model Using Latent Dirichlet Allocation (LDA)
- Day 19 - 30 Days 30 ML Projects: Customer Churn Prediction With XGBoost
- Day 18 - 30 Days 30 ML Projects: Time Series Forecasting of Stock Prices With ARIMA Model
- Day 17 - 30 Days 30 ML Projects: Predict Diabetes Onset Using Decision Trees and Random Forests
- Day 16 - 30 Days 30 ML Projects: Real-Time Face Detection in a Webcam Feed Using OpenCV
- Day 15 - 30 Days 30 ML Projects: Predict House Prices With XGBoost
- Day 14 - 30 Days 30 ML Projects: Cluster Grocery Store Customers With K-Means
- Day 13 - 30 Days 30 ML Projects: Build a Music Genre Classifier Using Audio Features Extraction
- Day 12 - 30 Days 30 Machine Learning Projects Challenge
- Day 11 - 30 Days 30 Machine Learning Projects: Anomaly Detection With Isolation Forest
- Day 10 - 30 Days 30 Machine Learning Projects: Recommender System Using Collaborative Filtering
- Day 9 - 30 Days 30 Machine Learning Projects
- Day 8 - 30 Days 30 Machine Learning Projects
- Day 7 - 30 Days 30 Machine Learning Projects
- Day 6 - 30 Days 30 Machine Learning Projects
- Day 5 - 30 Days 30 Machine Learning Projects
- Day 4 - 30 Days 30 Machine Learning Projects
- Day 3 - 30 Days 30 Machine Learning Projects
- Day 2 - 30 Days 30 Machine Learning Projects
- Day 1 - 30 Days 30 Machine Learning Projects
- 30 Days 30 Machine Learning Projects Challenge