Predict the bike demand in future


#1

Build the model to predict the bike remand in the future. This project is part of the “AI and Machine Learning for Managers” course. Find more details about the project here.

After you have trained the model, document the RMSE and your learnings here. Be as detailed as possible.

Your project might be graded in the course and we will publish a few solutions in our blog mentioning your name.

All the best and happy learning!


#2

In response to Predict the bike demand in future task:

Following the checklist sequence of approaching ML projects, I used BootML to build, select and fine tune the model as below:

  1. I looked at the big picture and thought what could be factors affecting bike demand and concluded that temperature, rain, etc climate conditions can be, as well as price, traffic and road safety
  2. I gathered the data and saw that it has features which seemed relevant such as season and temperature and also field which seemed not relevant such as record index, etc.
  3. I explored the data and I confirmed my assumptions that there is a positive correlation with temperature and season and found out negative correlation with humidity
  4. I prepared the data for ML by replacing the missing values with median value, stratified the data based on season and performed feature scaling.
  5. I built 3 ML models, using Linear Regression, Decision Tree and Random Forest algorithms respectively. I received the following results for RMSE and cross-validation mean RMSE:

Linear Regression: RMSE 142; Cross Valid RMSE: Mean 142
Decision Tree: RMSE 0.45; Cross Valid RMSE: Mean 61
Random Forest: RMSE 19; Cross Valid RMSE: Mean 46

From these results, I concluded that the Linear Regression model didn’t perform well as it had high RMSE for both the initial set and the cross validation folds. The Decision Tree algorithm showed very significant difference which indicates overfitting of the model as it failed to generalize well during the cross validation. I evaluated and selected the Random Forest as the best model from the 3 in this case.
6) BootMl has also selected this model for fine-tuning and has performed hyperparameters tuning through Grid Search algorithm as I’ve selected previously.
7) The model showed that the feature importance score for prediction is as follows:

[(0.587627766224623, ‘hr’),
(0.12311873416414294, ‘temp’),
(0.08328849280662341, ‘yr’),
(0.06137417658730989, ‘workingday’),
(0.03907266576921952, ‘hum’),
(0.026561816940300353, ‘season’),
(0.023726472869670133, ‘weekday’),
(0.02209473943702688, ‘mnth’),
(0.01769931667812533, ‘weathersit’),
(0.013187712009321615, ‘windspeed’),
(0.002248106513636926, ‘holiday’)]

which indicates that the hr and temp are the highest contributors to affecting demand.
The final RMSE after tuning was 40.
8) If this RMSE is satisfactory, the model could be deployed.Otherwise further data can be added and the models re-run or other algorithms can be used with the same data in efforts to build a better performing model.


#3

Brief on the project for predicting the future demand of Bikes.
I had use BootML for this exercise and following are the results from BootML:

Linear Regression: - RMSE: 142.554 Cross Validation (Mean): 142.6 Std Dev: 3.951
Decision Tree: - RMSE: 0.598 Cross Validation (Mean): 60.442 Std Dev: 1.744
Random Forest: - RMSE: 19.501 Cross Validation (Mean): 46.173 Std Dev: 1.801

Random Forest Final RMSE: 41.307

Following are my assessment:

  1. Decision tree thought has a very less RMSE from the training data but the Cross validation has given a very high score this is again a sign of over fitting.
  2. Linear Regression also has a very high RMSE for both training data and cross validation. Also, the Standard deviation is at 3.9 hence I feel this model is not much suites
  3. Random Forest thought has variance between the training data and cross validation which is a sign of over fitting but out of all these models this seems to be best suited. Also, the over fitting can be sorted by using more training data.

Thanks you!


#4

Summary of experiment in AzureML

  1. There are three labels in the data: cnt, registered and casual. The total bike demand is sum of registered and casual bike demand
  2. The data shows positive negative correlation to temperature and positive correlation to windspeed. It appears that a linear regression may be suited to this problem
  3. In select Columns, index and date labels have been removed. Similarly registered and casual have been removed and only cnt retained.
  4. Splitting data using 80-20 rule for training and test data respectively
  5. I am not sure if normalization is required because the data is not spread too far. I ran both scenarios using normalization and not using normalization and did not see much difference in results
  6. Have tried two models namely, linear regression and boosted decision tree and loaded the scored results into evaluation:

Linear Regression: RMSE 973
Decision Tree: RMSE 732;

Decision Tree is the better of the models

  1. I have run two separate models for predicting casual and registered demand. It appears that predicting the demand separately and then adding to get total bike demand is preferred as it leads to lower RMSE.