Predict the Bike Demand in future
Using BootMl:
URL: https://jupyter.e.cloudxlab.com/user/anapatel294777/notebooks/BootML/Projects/Bike_Rental_Assmnt_1/Bike_Rental_Assmnt_1_anapatel294777.ipynb
Objective is to build the model which predicts bike demand in future using existing dataset Bike_Data.
Type of project is supervised learning and we are using regression algorithms to build the model.
The performance measures-> we selected for this project is Root Mean Square Error.
The data files used is bikes.csv.
Following are the total 17 attributes in bike.csv file
Data columns (total 17 columns):
instant 17379 non-null int64
dteday 17379 non-null object
season 17379 non-null int64
yr 17379 non-null int64
mnth 17379 non-null int64
hr 17379 non-null int64
holiday 17379 non-null int64
weekday 17379 non-null int64
workingday 17379 non-null int64
weathersit 17379 non-null int64
temp 17379 non-null float64
atemp 17379 non-null float64
hum 17379 non-null float64
windspeed 17379 non-null float64
casual 17379 non-null int64
registered 17379 non-null int64
cnt 17379 non-null int64
Discarded irrelevent arttributes from bike.csv are:
instant, dteday, atemp, casual and registered
The relevant attributes in bikes.csv are :
season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed, cnt,
Features are:
season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed
Lable is:
cnt
Split Train data and Test data in ratio of 80:20
Data Visualization:
season 0.178468
yr 0.256133
mnth 0.120683
hr 0.395196
holiday -0.027901
weekday 0.024038
workingday 0.026984
weathersit -0.135369
temp 0.400834
hum -0.323585
windspeed 0.095034
cnt 1.000000
Explored data to see how cnt is corelated with other features and found cnt has positive corelation with temp and negetive corelation with humidity.
Prepared data and fix the missing values with median.
As we have all numerical data no need to separate the numerical and categorical features.
Used standardization for feature scalling
Used Linear Regression, Random Forest and Decision Tree algorithms and computed RMSE, the cross-validation score and standard Deviation for each algorithm. They are as below:
Linear Regression:
RMSE: 142.75546873823012
CV Mean: 142.8232428746583
SD: 3.547820274468247
Random Forest:
RMSE: 16.20881230645367
CV Mean: 44.14364187895045
SD: 1.9809271591973698
Decision Tree:
RMSE: 0.45367266614794205
CV Mean: 61.14492822650442
SD: 3.4314395049588122
My Assumptions:
- Linear Regression(LR) RMSE and CV Mean is very much high as compare to both Decision Tree and Random Forest algorithm hence LR is not perfect model.
- Decision tree algorithm might win as best model but its Cross validation mean is higher as compare to Random Forest Algorithm
- So, Random Forest algorithm seems to be the best fit model among 3 algorithm even though there is variation between training data and cross validation folds show over fitting but can be resolve using more train data.
- The other measure of the good model is the low standard deviation in error. Again SD is also minimum in Random Forest. Hence, selected Random Forest model as the best model.
BootMl also selected Random Forest model for fine-tuning. After fine tuning the final RMSE is 40.15593877206158
Thanks!