Day 9 - 30 Days 30 Machine Learning Projects

Problem: Forecasting weather with Simple Linear Regression on time series data

Hey, it’s Day 9 of the 30 Day 30 Machine Learning Projects Challenge. Today’s challenge was about forecasting weather using a Simple Linear Regression model. The goal was to predict future temperatures based on historical temperature data. Let’s break it down step-by-step and see how the model performed.

If you want to see the code, you can find it here: GIT REPO.

Understanding the Data

We used the Daily Temperature of Major Cities dataset from Kaggle. It contains temperature data from cities around the world, recorded daily. For this project, we filtered the data to focus on India and used Date and AvgTemperature (average temperature) as the primary columns for forecasting. Download, unzip and put it in the dataset directory at the root level of your project.

The data spans multiple years, and the challenge was to predict the temperature for future dates based on past data using a time series approach.

Step-by-Step Code Workflow

The code was broken down into the following steps:

Step 1: Load the Data

We started by loading the dataset using pandas. Since the data is large, we set low_memory=False to avoid mixed-type warnings during loading.

data = pd.read_csv('dataset/city_temperature.csv', low_memory=False)

Step 2: Filter Data for India

We filtered the dataset for India and removed any invalid temperature values (AvgTemperature > -99).

india_data = data[(data['Country'] == 'India') & (data['AvgTemperature'] > -99)].copy()

Step 3: Combine Date Columns

Next, we combined the Year, Month, and Day columns into a single Date column to create a proper time series.

india_data.loc[:, 'Date'] = pd.to_datetime(india_data[['Year', 'Month', 'Day']])

Step 4: Select Relevant Columns

We only kept the relevant columns — Date and AvgTemperature — for our analysis.

rel_india_data = india_data[['Date', 'AvgTemperature']]

Step 5: Preprocess the Data

We removed any missing values to ensure clean data, then converted the Date into a numeric format using ordinal numbers. This allowed our Linear Regression model to work with time as a feature.

rel_india_data = rel_india_data.dropna()
rel_india_data['Date_ordinal'] = rel_india_data['Date'].map(pd.Timestamp.toordinal)

X = rel_india_data[['Date_ordinal']] # Feature
y = rel_india_data['AvgTemperature'] # Target

Step 6: Train-Test Split

We split the data into training (80%) and validation (20%) datasets using train_test_split.

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Step 7: Build and Train the Model

We trained a Simple Linear Regression model using the training data. The model tries to fit a straight line to the relationship between time and temperature.

model = LinearRegression()
model.fit(X_train, y_train)

Step 8: Make Predictions and Evaluate

The model made predictions on the validation data. We evaluated its performance using Mean Squared Error (MSE), which measures how far the predicted values are from the actual values.

predictions = model.predict(X_val)
mean_squared_error = mean_squared_error(y_val, predictions)
print(mean_squared_error)

Step 9: Visualization

We visualized the model’s predictions against the actual temperatures. The blue dots represent actual temperatures, while the red line represents the predicted temperatures.

plt.figure(figsize=(7, 5))
plt.scatter(X_val, y_val, color='blue', label='Actual Temperature')
plt.plot(X_val, predictions, color='red', linewidth=2, label='Predicted Temperature')
plt.xlabel('Date (Ordinal)')
plt.ylabel('Temperature')
plt.title('Weather Forecast for India: Actual vs Predicted')
plt.legend()
plt.show()

Model Performance

The model achieved an accuracy of 74%, which is moderate. However, the performance is not ideal, as weather patterns can be very complex and linear models often fail to capture these trends accurately.

  • Prediction Line: The predicted temperatures were nearly constant, as seen in the plot. This is a limitation of the linear model, as it struggles to capture non-linear, seasonal trends in weather data.
Day 9 Scattered Plot

What Can We Do?

To improve the forecast, here are a few options to explore:

  • Introduce Complexity: We might need more sophisticated models like Polynomial Regression (to capture nonlinear trends) or Time Series Models like ARIMA or Prophet, which can account for seasonal patterns.
  • Add More Features: Simple Linear Regression is based only on the date. Adding additional features, such as previous day’s temperature, humidity, or atmospheric pressure, might help the model capture more intricate weather patterns.

Key Takeaways

Simple Linear Regression can capture basic trends, but it struggles with complex data like weather forecasting.

Gratitude

Working on weather forecasting with time series data was a great experience. It highlighted the limitations of linear models for complex patterns like weather and gave me insight into how we can tackle such problems using more advanced techniques.

Stay tuned!