Predict the Bike Demand in future

Using BootMl:

URL: https://jupyter.e.cloudxlab.com/user/anapatel294777/notebooks/BootML/Projects/Bike_Rental_Assmnt_1/Bike_Rental_Assmnt_1_anapatel294777.ipynb

Objective is to build the model which predicts bike demand in future using existing dataset Bike_Data.

Type of project is supervised learning and we are using regression algorithms to build the model.

The performance measures-> we selected for this project is Root Mean Square Error.

The data files used is bikes.csv.

Following are the total 17 attributes in bike.csv file

Data columns (total 17 columns):

instant 17379 non-null int64

dteday 17379 non-null object

season 17379 non-null int64

yr 17379 non-null int64

mnth 17379 non-null int64

hr 17379 non-null int64

holiday 17379 non-null int64

weekday 17379 non-null int64

workingday 17379 non-null int64

weathersit 17379 non-null int64

temp 17379 non-null float64

atemp 17379 non-null float64

hum 17379 non-null float64

windspeed 17379 non-null float64

casual 17379 non-null int64

registered 17379 non-null int64

cnt 17379 non-null int64

Discarded irrelevent arttributes from bike.csv are:

instant, dteday, atemp, casual and registered

The relevant attributes in bikes.csv are :

season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed, cnt,

Features are:

season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed

Lable is:

cnt

Split Train data and Test data in ratio of 80:20

Data Visualization:

season 0.178468

yr 0.256133

mnth 0.120683

hr 0.395196

holiday -0.027901

weekday 0.024038

workingday 0.026984

weathersit -0.135369

temp 0.400834

hum -0.323585

windspeed 0.095034

cnt 1.000000

Explored data to see how cnt is corelated with other features and found cnt has positive corelation with temp and negetive corelation with humidity.

Prepared data and fix the missing values with median.

As we have all numerical data no need to separate the numerical and categorical features.

Used standardization for feature scalling

Used Linear Regression, Random Forest and Decision Tree algorithms and computed RMSE, the cross-validation score and standard Deviation for each algorithm. They are as below:

Linear Regression:

RMSE: 142.75546873823012

CV Mean: 142.8232428746583

SD: 3.547820274468247

Random Forest:

RMSE: 16.20881230645367

CV Mean: 44.14364187895045

SD: 1.9809271591973698

Decision Tree:

RMSE: 0.45367266614794205

CV Mean: 61.14492822650442

SD: 3.4314395049588122

My Assumptions:

- Linear Regression(LR) RMSE and CV Mean is very much high as compare to both Decision Tree and Random Forest algorithm hence LR is not perfect model.
- Decision tree algorithm might win as best model but its Cross validation mean is higher as compare to Random Forest Algorithm
- So, Random Forest algorithm seems to be the best fit model among 3 algorithm even though there is variation between training data and cross validation folds show over fitting but can be resolve using more train data.
- The other measure of the good model is the low standard deviation in error. Again SD is also minimum in Random Forest. Hence, selected Random Forest model as the best model.

BootMl also selected Random Forest model for fine-tuning. After fine tuning the final RMSE is 40.15593877206158

Thanks!