In response to Predict the bike demand in future task:
Following the checklist sequence of approaching ML projects, I used BootML to build, select and fine tune the model as below:
- I looked at the big picture and thought what could be factors affecting bike demand and concluded that temperature, rain, etc climate conditions can be, as well as price, traffic and road safety
- I gathered the data and saw that it has features which seemed relevant such as season and temperature and also field which seemed not relevant such as record index, etc.
- I explored the data and I confirmed my assumptions that there is a positive correlation with temperature and season and found out negative correlation with humidity
- I prepared the data for ML by replacing the missing values with median value, stratified the data based on season and performed feature scaling.
- I built 3 ML models, using Linear Regression, Decision Tree and Random Forest algorithms respectively. I received the following results for RMSE and cross-validation mean RMSE:
Linear Regression: RMSE 142; Cross Valid RMSE: Mean 142
Decision Tree: RMSE 0.45; Cross Valid RMSE: Mean 61
Random Forest: RMSE 19; Cross Valid RMSE: Mean 46
From these results, I concluded that the Linear Regression model didn’t perform well as it had high RMSE for both the initial set and the cross validation folds. The Decision Tree algorithm showed very significant difference which indicates overfitting of the model as it failed to generalize well during the cross validation. I evaluated and selected the Random Forest as the best model from the 3 in this case.
6) BootMl has also selected this model for fine-tuning and has performed hyperparameters tuning through Grid Search algorithm as I’ve selected previously.
7) The model showed that the feature importance score for prediction is as follows:
which indicates that the hr and temp are the highest contributors to affecting demand.
The final RMSE after tuning was 40.
8) If this RMSE is satisfactory, the model could be deployed.Otherwise further data can be added and the models re-run or other algorithms can be used with the same data in efforts to build a better performing model.