Project Objective : Predict the bike demand in future
First I tried looking at the bigger picture by
A)Understanding the business objective and how the created solution will help the business.
B)We will be using Supervised Regression as we need to predict median house value which is a continuous variable and we already have the input features and the expected labels in the dataset which means it is a supervised problem.
I figured out that I need to build different models and choose the best one based on the performance metric chosen.
C) I chose root mean squared error as the performance metric.A typical performance measure for regression problems is the Root Mean Square Error (RMSE) which is the square root of Mean Squared Error.
The mean squared error basically is the sum of squares of the errors in each prediction.
In this project, I used BootML to train a model that would use three different algorithms, namely Linear regression, Decision Tree and random forest for the prediction and the goal is to select the best algorithm which is more accurate.
-
Import the data :
- Load the dataset from the specified path ‘/cxldata/datasets/bootml/Bikes_Data_1’.
-
Analyze the data :
- Explore the dataset to understand its structure and features.
-
Drop the irrelevant fields:
- Discard unwanted fields such as ‘instant’, ‘dteday’, ‘atemp’, ‘casual’, ‘registered’ to reduce complexity.
-
Understand the data
- Identify the features and the target variable(label).
- Here, our features were ‘season’, ‘yr’, ‘mnth’, ‘hr’, ‘holiday’, ‘weekday’, ‘workingday’, ‘weathersit’,
‘temp’, ‘hum’, ‘windspeed’ while our label was ‘cnt’.
- Understand the distribution and relationship between variables.
-
Split the data into train and test
- Divide the dataset into training and testing sets, typically using an 80:20 ratio.
- I stratified the data based on weathersit.
-
Analyze the data through visualizations
- Plot graphs to visualize the relationships between different variables to gain deeper understanding
of the cofactors.
- Identify any patterns or trends in the data.
- After visualizing the data, it was found that the negative cofactors affecting the bike demand were
‘holidays’, ‘weathersit’ and ‘humidity’.
-
Data Cleaning & Preprocessing
- Clean the data by handling missing values using techniques like imputation.
- Using the library ‘Scikit-Learn Imputer class’ to fill missing values to clean our data. For any missing values, we substitute the missing value with the mean/median/zero.If there are excess missing values we delete the specific rows or the entire column.
- We then remove the text features from the dataset because the class imputer works on numerical attributes only and not categorical features.
- I selected “Median” as the imputer.
- Scale the numerical features using standard feature scaling to bring them to a similar scale.All the numerical values like ‘yr’, ‘hr’ ,‘temp’ ,‘hum’ ,‘windspeed’ are scaled.
Our Data is ready to be fed to the ML Algorithms. We will feed the data to different models and figure out which model scores best in terms of chosen performance metric.
- Train the model
- Train machine learning models using different algorithms like Linear Regression, Decision Tree, and Random Forest.
After training different ML Models we obtain the following results:
Linear regression model.
RMSE: 142.72311067462795
Mean: 142.77813742724223
Standard deviation: 3.7030212904153004
Decision tree model :
RMSE: 0.5989453436724405
Mean: 59.78945853931843
Standard deviation: 3.251904916375031
Random Forest model :
RMSE: 15.984077447144026
Mean: 43.49698996838114
Standard deviation: 2.4391843492444463
- Fine-tune the model
- Tune hyperparameters of the models using techniques like GridSearchCV or RandomizedSearchCV to improve performance.
The best hyperparameter combinations
grid_search.best_params_ :{‘max_features’: 8, ‘n_estimators’: 30}
Importance score of each attribute:
[(0.5844748088424341, ‘hr’),
(0.12946274868458668, ‘temp’),
(0.07775038801366478, ‘yr’),
(0.05910890006087216, ‘workingday’),
(0.039755394080211844, ‘hum’),
(0.028163876532795794, ‘season’),
(0.027393454181294748, ‘weekday’),
(0.01926425477094257, ‘mnth’),
(0.01890966484897769, ‘weathersit’),
(0.013190030433310636, ‘windspeed’),
(0.0025264795509088818, ‘holiday’)]
This indicates that the hr and temp are the highest contributors to affecting demand.
Upon analysis, it’s clear that while the Decision Tree model initially appears favorable due to its low RMSE, further examination reveals significant overfitting indicated by the large disparity between its RMSE and mean values. Consequently, we discard this model.
Turning to the Linear Regression and Random Forest models, while the former exhibits a closer alignment between RMSE and mean values, it suffers from a high RMSE. In contrast, although the Random Forest model displays overfitting, its potential for improvement through additional data training makes it a more promising option.
In summary, despite the Linear Regression model’s closer alignment between RMSE and mean values, the Random Forest model holds greater potential for delivering better predictions.
Best Estimator : Random Forest
- Evaluating the model on Test Set
- Finally I validated the model using test data to ensure its generalization ability.
Final RMSE : 41.96929332427234