Predict the Bike Demand in future
Objective is to build the model which predicts bike demand in future using existing dataset Bike_Data.
Type of project is supervised learning and we are using regression algorithms to build the model.
The performance measures-> we selected for this project is Root Mean Square Error.
The data files used is bikes.csv.
Following are the total 17 attributes in bike.csv file
Data columns (total 17 columns):
instant 17379 non-null int64
dteday 17379 non-null object
season 17379 non-null int64
yr 17379 non-null int64
mnth 17379 non-null int64
hr 17379 non-null int64
holiday 17379 non-null int64
weekday 17379 non-null int64
workingday 17379 non-null int64
weathersit 17379 non-null int64
temp 17379 non-null float64
atemp 17379 non-null float64
hum 17379 non-null float64
windspeed 17379 non-null float64
casual 17379 non-null int64
registered 17379 non-null int64
cnt 17379 non-null int64
Discarded irrelevent arttributes from bike.csv are:
instant, dteday, atemp, casual and registered
The relevant attributes in bikes.csv are :
season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed, cnt,
season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed
Split Train data and Test data in ratio of 80:20
Explored data to see how cnt is corelated with other features and found cnt has positive corelation with temp and negetive corelation with humidity.
Prepared data and fix the missing values with median.
As we have all numerical data no need to separate the numerical and categorical features.
Used standardization for feature scalling
Used Linear Regression, Random Forest and Decision Tree algorithms and computed RMSE, the cross-validation score and standard Deviation for each algorithm. They are as below:
CV Mean: 142.8232428746583
CV Mean: 44.14364187895045
CV Mean: 61.14492822650442
- Linear Regression(LR) RMSE and CV Mean is very much high as compare to both Decision Tree and Random Forest algorithm hence LR is not perfect model.
- Decision tree algorithm might win as best model but its Cross validation mean is higher as compare to Random Forest Algorithm
- So, Random Forest algorithm seems to be the best fit model among 3 algorithm even though there is variation between training data and cross validation folds show over fitting but can be resolve using more train data.
- The other measure of the good model is the low standard deviation in error. Again SD is also minimum in Random Forest. Hence, selected Random Forest model as the best model.
BootMl also selected Random Forest model for fine-tuning. After fine tuning the final RMSE is 40.15593877206158