Predict the bike demand in future

Build the model to predict the bike remand in the future. This project is part of the “AI and Machine Learning for Managers” course. Find more details about the project here.

After you have trained the model, document the RMSE and your learnings here. Be as detailed as possible.

Your project might be graded in the course and we will publish a few solutions in our blog mentioning your name.

All the best and happy learning!

In response to Predict the bike demand in future task:

Following the checklist sequence of approaching ML projects, I used BootML to build, select and fine tune the model as below:

  1. I looked at the big picture and thought what could be factors affecting bike demand and concluded that temperature, rain, etc climate conditions can be, as well as price, traffic and road safety
  2. I gathered the data and saw that it has features which seemed relevant such as season and temperature and also field which seemed not relevant such as record index, etc.
  3. I explored the data and I confirmed my assumptions that there is a positive correlation with temperature and season and found out negative correlation with humidity
  4. I prepared the data for ML by replacing the missing values with median value, stratified the data based on season and performed feature scaling.
  5. I built 3 ML models, using Linear Regression, Decision Tree and Random Forest algorithms respectively. I received the following results for RMSE and cross-validation mean RMSE:

Linear Regression: RMSE 142; Cross Valid RMSE: Mean 142
Decision Tree: RMSE 0.45; Cross Valid RMSE: Mean 61
Random Forest: RMSE 19; Cross Valid RMSE: Mean 46

From these results, I concluded that the Linear Regression model didn’t perform well as it had high RMSE for both the initial set and the cross validation folds. The Decision Tree algorithm showed very significant difference which indicates overfitting of the model as it failed to generalize well during the cross validation. I evaluated and selected the Random Forest as the best model from the 3 in this case.
6) BootMl has also selected this model for fine-tuning and has performed hyperparameters tuning through Grid Search algorithm as I’ve selected previously.
7) The model showed that the feature importance score for prediction is as follows:

[(0.587627766224623, ‘hr’),
(0.12311873416414294, ‘temp’),
(0.08328849280662341, ‘yr’),
(0.06137417658730989, ‘workingday’),
(0.03907266576921952, ‘hum’),
(0.026561816940300353, ‘season’),
(0.023726472869670133, ‘weekday’),
(0.02209473943702688, ‘mnth’),
(0.01769931667812533, ‘weathersit’),
(0.013187712009321615, ‘windspeed’),
(0.002248106513636926, ‘holiday’)]

which indicates that the hr and temp are the highest contributors to affecting demand.
The final RMSE after tuning was 40.
8) If this RMSE is satisfactory, the model could be deployed.Otherwise further data can be added and the models re-run or other algorithms can be used with the same data in efforts to build a better performing model.

2 Likes

Brief on the project for predicting the future demand of Bikes.
I had use BootML for this exercise and following are the results from BootML:

Linear Regression: - RMSE: 142.554 Cross Validation (Mean): 142.6 Std Dev: 3.951
Decision Tree: - RMSE: 0.598 Cross Validation (Mean): 60.442 Std Dev: 1.744
Random Forest: - RMSE: 19.501 Cross Validation (Mean): 46.173 Std Dev: 1.801

Random Forest Final RMSE: 41.307

Following are my assessment:

  1. Decision tree thought has a very less RMSE from the training data but the Cross validation has given a very high score this is again a sign of over fitting.
  2. Linear Regression also has a very high RMSE for both training data and cross validation. Also, the Standard deviation is at 3.9 hence I feel this model is not much suites
  3. Random Forest thought has variance between the training data and cross validation which is a sign of over fitting but out of all these models this seems to be best suited. Also, the over fitting can be sorted by using more training data.

Thanks you!

3 Likes

Summary of experiment in AzureML

  1. There are three labels in the data: cnt, registered and casual. The total bike demand is sum of registered and casual bike demand
  2. The data shows positive negative correlation to temperature and positive correlation to windspeed. It appears that a linear regression may be suited to this problem
  3. In select Columns, index and date labels have been removed. Similarly registered and casual have been removed and only cnt retained.
  4. Splitting data using 80-20 rule for training and test data respectively
  5. I am not sure if normalization is required because the data is not spread too far. I ran both scenarios using normalization and not using normalization and did not see much difference in results
  6. Have tried two models namely, linear regression and boosted decision tree and loaded the scored results into evaluation:

Linear Regression: RMSE 973
Decision Tree: RMSE 732;

Decision Tree is the better of the models

  1. I have run two separate models for predicting casual and registered demand. It appears that predicting the demand separately and then adding to get total bike demand is preferred as it leads to lower RMSE.
1 Like

RMSE:
Decision Tree - 0.4259104341371864
Random Forest Model - 19.044611139765895
Linear Regression Model - 141.24365716067499
Cross Validation Mean RMSE:
Decision Tree - 60.8874659713422
Random Forest Model - 45.27664533486017
Linear Regression Model - 141.30051634070458
Use Random forest model as a hyper terminal model
Final RMSE : 40.913059158514464

1 Like

Correlation:
season 0.184377
yr 0.255502
mnth 0.127409
hr 0.391871
holiday -0.034094
weekday 0.028801
workingday 0.027403
weathersit -0.144581
temp 0.403476
atemp 0.399118
hum -0.324475
windspeed 0.088802

  1. Linear Regression:
    RMSE: 142.45
    CV RMSE: 140.51

  2. Decision Tree:
    RMSE: 0.59
    CV RMSE: 59.58

  3. Random Forest:
    RMSE: 19.42
    CV RMSE: 46

Random Forest Model:
Final RMSE: 41.13

The objective of the project Bikedemand is to predict the bike demand in the future by creating a suitable model based on the already existing data. We are using the dataset which contains the hourly rental bike demand data. It is located at /cxldata/datasets/bootml/Bikes_Data_1

The type of project is supervised learning and we are using regression algorithms to build the model. The performance measures, we selected for this project is Root Mean Square Error. The data files used is bikes.csv.

We discarded the irrelevant features like instant, dteday, atemp, casual and registered from the input file. Since we need to do the prediction for the bike demand, we selected Cnt as the label here.

Then we split the data into training set and test set in the ratio of 80:20 and given the random seed as 42. This will help to initialize with the same data for training and test set. I have used stratified sampling on weathersit feature to make sure we have good spread or diversity of data which representing each value for weathersit feature. Then we visualize the data by checking the correlation between different features by selecting the kind of visualization, and then generate the correlations and scatter matrix. Then we do the imputation of the data by data cleaning and feature scaling. The missing values are replaced with median and we selected standardization as the feature scaling technique.

The we focused on cross validation and created 10 folds and test the same with three algorithms like Linear Regression, Random Forest and Decision Tree. We then fine tune the hyperparameters by selecting the grid search. Finally we are able to generate the machine learning code in the Jupyter notebook. Then we run the code and analyzed the model performance from the result.

The RMSE for the liner regression model is 142.72311067462795

Mean: 142.77813742724223

Standard deviation: 3.7030212904152973

The RMSE for the Decision tree model is 0.5989453436724405

Mean: 59.78945853931843

Standard deviation: 3.251904916375031

The RMSE for the Random forest model is 19.581022508088243

Mean: 45.455970771203454

Standard deviation: 2.7452970010730895

The final RMSE for the Random forest model after the fine tuning is

41.96929332427234

By comparing the various model we can see the standard is less in the case of Random forest model, which is also selected here and fine-tuned and the final RMSE is 41.969.

If this performance is fine, then we can present the solution to the Customers otherwise we need to revisit the quality of data and fine tune and come up with better solution.

The final model created with results is in https://jupyter.e.cloudxlab.com/user/alagesan10p5162/notebooks/BootML/Projects/Bikedemand_1/Bikedemand_1_alagesan10p5162.ipynb

  1. The aim here is to build the model which estimates the bike demand in future given the parameters as observed in the past.
  2. Azure ML was used in this model.
  3. Data was uploaded from ml/machine_learning/datasets/bike_sharing/day.csv.
  4. The first two features i.e instant and date ere dropped in while Selecting Columns in Dataset. Additionally Registered and Casual were also dropped since aim here was to obtain Combined value available at ctn.
  5. ctn was declared as label.
  6. Split ratio between training and test dataset is 80:20% and stratification being humidity.
  7. Two models were used i.e Linear Regression and Decision Tree Regression model.
  8. Using Linear Regression the Final Root Mean Squared Error was 1483.986895. Decision Tree Regression Model gave Final Root Mean squared Error as 1369.039059.

Correlation:
season 0.184377
yr 0.255502
mnth 0.127409
hr 0.391871
holiday -0.034094
weekday 0.028801
workingday 0.027403
weathersit -0.144581
temp 0.403476
atemp 0.399118
hum -0.324475
windspeed 0.088802

  1. Linear Regression:
    RMSE: 142.45
    CV RMSE: 140.51
  2. Decision Tree:
    RMSE: 0.59
    CV RMSE: 59.58
  3. Random Forest:
    RMSE: 19.42
    CV RMSE: 46

Random Forest Model:
Final RMSE: 41.13

Hi,

I have used bootml for project work.

  1. Analyzed the data.
  2. discarded the irrelevant data fields.
  3. used 80:20 rule for data split.
  4. compared RMSE using 3 model random forest, decision tree and linear regression.
  5. fine tuned the random forest model.

RMSE in Random Forest model = 254.4944410022101
RMSE in Decision Tree model = 0.0
RMSE in Linear Regression Model = 878.4199081197479

final RMSE = 675.3891431118869.

Thanks
Deepak

1 Like

Predict the Bike Demand in future

Using BootMl:

URL: https://jupyter.e.cloudxlab.com/user/anapatel294777/notebooks/BootML/Projects/Bike_Rental_Assmnt_1/Bike_Rental_Assmnt_1_anapatel294777.ipynb

Objective is to build the model which predicts bike demand in future using existing dataset Bike_Data.

Type of project is supervised learning and we are using regression algorithms to build the model.
The performance measures-> we selected for this project is Root Mean Square Error.
The data files used is bikes.csv.

Following are the total 17 attributes in bike.csv file

Data columns (total 17 columns):
instant 17379 non-null int64
dteday 17379 non-null object
season 17379 non-null int64
yr 17379 non-null int64
mnth 17379 non-null int64
hr 17379 non-null int64
holiday 17379 non-null int64
weekday 17379 non-null int64
workingday 17379 non-null int64
weathersit 17379 non-null int64
temp 17379 non-null float64
atemp 17379 non-null float64
hum 17379 non-null float64
windspeed 17379 non-null float64
casual 17379 non-null int64
registered 17379 non-null int64
cnt 17379 non-null int64

Discarded irrelevent arttributes from bike.csv are:

instant, dteday, atemp, casual and registered

The relevant attributes in bikes.csv are :

season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed, cnt,

Features are:

season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed

Lable is:

cnt

Split Train data and Test data in ratio of 80:20

Data Visualization:
season 0.178468
yr 0.256133
mnth 0.120683
hr 0.395196
holiday -0.027901
weekday 0.024038
workingday 0.026984
weathersit -0.135369
temp 0.400834
hum -0.323585
windspeed 0.095034
cnt 1.000000

Explored data to see how cnt is corelated with other features and found cnt has positive corelation with temp and negetive corelation with humidity.

Prepared data and fix the missing values with median.
As we have all numerical data no need to separate the numerical and categorical features.
Used standardization for feature scalling

Used Linear Regression, Random Forest and Decision Tree algorithms and computed RMSE, the cross-validation score and standard Deviation for each algorithm. They are as below:

Linear Regression:
RMSE: 142.75546873823012
CV Mean: 142.8232428746583
SD: 3.547820274468247

Random Forest:
RMSE: 16.20881230645367
CV Mean: 44.14364187895045
SD: 1.9809271591973698

Decision Tree:
RMSE: 0.45367266614794205
CV Mean: 61.14492822650442
SD: 3.4314395049588122

My Assumptions:

  1. Linear Regression(LR) RMSE and CV Mean is very much high as compare to both Decision Tree and Random Forest algorithm hence LR is not perfect model.
  2. Decision tree algorithm might win as best model but its Cross validation mean is higher as compare to Random Forest Algorithm
  3. So, Random Forest algorithm seems to be the best fit model among 3 algorithm even though there is variation between training data and cross validation folds show over fitting but can be resolve using more train data.
  4. The other measure of the good model is the low standard deviation in error. Again SD is also minimum in Random Forest. Hence, selected Random Forest model as the best model.

BootMl also selected Random Forest model for fine-tuning. After fine tuning the final RMSE is 40.15593877206158

Thanks!

FINAL RMSE: 41.43
If the expected result OK with the obtained results then publish the model else add more input.

[(0.5940383211759476, ‘hr’),
(0.1314648370277314, ‘temp’),
(0.08115055651157858, ‘yr’),
(0.054600523654828356, ‘workingday’),
(0.038212596573476344, ‘hum’),
(0.024857598473005206, ‘season’),
(0.023493412411162557, ‘weekday’),
(0.02033753184618983, ‘mnth’),
(0.016669847446518997, ‘weathersit’),
(0.012790932124135795, ‘windspeed’),
(0.0023838427554254754, ‘holiday’)]

‘hr’ and ‘temp’ - more affective contributor to the results(label)

RMSE in Random Forest model = 16.070058811417443
Cross Validation in Random Forest model =
Mean: 43.73715385071471
Standard deviation: 1.6048984827096304

RMSE in Decision Tree model = 0.5989453436724405
K-fold Cross Validation for Decision Tree Model =
Mean: 60.40019917027024
Standard deviation: 1.9342512865620638

RMSE in Linear Regression Model = 142.55471466066635
K-fold Cross Validation for Linear Regression =
Mean: 142.60048335718665
Standard deviation: 3.9518468729854765

Random Forest model Algorithm seems best to choose for the model based on the RMSE value obtained.

Project Name :- Bike Assessment

Objective: - Predict the Bike Demand in future

Goal:- Develop a suitable model with the given parameters as observed in the past, which contains the hourly rental bike demand data.

Model identified:- supervised learning with regression algorithms to build the model. The performance measures has been selected as Root Mean Square Error. The data files used is bikes.csv.

Data Clean up:-

Irrelevant Features has been discarded. E.g :- ‘instant’,‘dteday’,‘atemp’,‘casual’,‘registered’

Label :- “cnt”

Features :- ‘season’,’ yr’,’ mnth’,’ hr’,’ holiday’,’ weekday’,’ workingday’,’ weathersit’,’ temp’,’ hum’,’ windspeed’.

The Dataset set has been spitted into data into training set and test set in the ratio of 80:20 and given the random seed as 42.

Stratified sampling has been used on weathersit feature to make sure a good spread or diversity of data which representing each value for weathersit feature.

The Correlation between different features is been visualized by using Visualization and by selecting scatter Matrix and generate the correlations.

Imputation of data by data cleaning and feature selection has been done and missing values are replaced by median and the standardization as the feature scaling technique has been selected.

The cross validation and 10 folds the 3 algorithms (Linear Regression, Random Forest and Decision Tree) RMSE has been tested. Then by using fine tune hyperparameters and selecting Grid Search the Final Model has been build.

Linear regression model

RMSE = 142.72311067462795

Mean: 142.77813742724223

Standard deviation: 3.7030212904152973

Decision Tree

RMSE = 0.5989453436724405

Mean: 59.78945853931843

Standard deviation: 3.251904916375031

Random Forest

RMSE = 19.581022508088243

Mean: 45.455970771203454

Standard deviation: 2.7452970010730895

The final RMSE for the Random forest model after the fine tuning is

41.96929332427234

Project Name : Predict Bike demand in future

Step 1. Added the dataset day.csv to the workspace
Step 2. Select Columns in Dataset -
season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed cnt
Step 3. Split Data - Ratio 80:20 and Random Seed 42 , Stratified sampling on weathersit column
Step 4. Clean Missing Data -
Columns to be cleaned - All selected columns in Step 2, Missing values are replaced by Median
Step 5. Normalize Data - Transformation Method used - ZScore
Columns Selected - temp atemp hum windspeed
Step 6. Used 3 different models to train the model
Step 7. Cross validated the model , created 10 folds and below are the results for each type of model.

Model 1: Linear Regression
RMSE (Cross Validation) - 904.595091
Standard deviation - 124.904566
Mean - 839.6196

Model 2: Boosted Decision Tree
RMSE (Cross Validation) - 662.472817
Standard deviation - 139.772036
Mean - 618.9144

Model 3: Decision Forest Regression
RMSE (Cross Validation) - 720.395626
Standard deviation - 188.713745
Mean - 676.0888

Step 8. Have also improved the model by fine tuning its hyperparameters
using Random Sweep and Mean absolute error in Metric for measuring performance for regression

Project Name - Bike Demand

Model identified- supervised learning with regression algorithms to build the model. The performance measures has been selected as Root Mean Square Error. The data files used is bikes.csv.

Data Clean up- Discarded Features = instant, dteday, atemp, casual, registered

Label- cnt

The Dataset set is split into training set and test set in the ratio 80:20 and random seed= 42

Stratified sampling- weathersit

Three algorithms- Linear Regression, Random Forest and Decision Tree are chosen to calculate RMSE. Then by fine tuning hyperparameters and selecting Grid Search the Final Model has been build.

Linear regression model

RMSE = 142.75546873823012
Mean=142.8232428746583
Standard deviation: 3.5478202744682483

Decision Tree

RMSE = 0.45367266614794205
Mean= 61.14492822650442
Standard deviation= 3.4314395049588122

Random Forest

RMSE = 16.20881230645367
Mean= 44.14364187895045
Standard deviation=1.9809271591973698

Based on the RMSE and mean Random forest model is chosen
On further fine tuning we get the final RMSE

RMSE= 40.15593877206158