Predict the bike demand in future

Project Name - Bike Demand

Model identified- supervised learning with regression algorithms to build the model. The performance measures has been selected as Root Mean Square Error. The data files used is bikes.csv.

Data Clean up- Discarded Features = instant, dteday, atemp, casual, registered

Label- cnt

The Dataset set is split into training set and test set in the ratio 80:20 and random seed= 42

Stratified sampling- weathersit

Three algorithms- Linear Regression, Random Forest and Decision Tree are chosen to calculate RMSE. Then by fine tuning hyperparameters and selecting Grid Search the Final Model has been build.

Linear regression model

RMSE = 142.75546873823012
Mean=142.8232428746583
Standard deviation: 3.5478202744682483

Decision Tree

RMSE = 0.45367266614794205
Mean= 61.14492822650442
Standard deviation= 3.4314395049588122

Random Forest

RMSE = 16.20881230645367
Mean= 44.14364187895045
Standard deviation=1.9809271591973698

Based on the RMSE and mean Random forest model is chosen
On further fine tuning we get the final RMSE

RMSE= 40.15593877206158

2 Likes

Tool: BootML
Type of Project: Supervised Learning- Regression
Performance Measure: RSME
Dataset: Bikes Data, type CSV
Discarded: Casual and Registered
Label: Cnt
Split:80 20, Seed:42
Stratified:Season
Visualize: Season/Count
Generated Coorelation
Date: Categorical
Scaling: Standardisation
algo: all three
Hyper Parameter: Grid Search
Stratified Sampling Suggest, equal demand in all seasons
Negative Corelation with weather, humidity and Holiday
RMSE:
Linear Regression: 133
Decision Tree:0
Random Forest: 14

Preferred Model: Random Forest

3 Likes

My findings on the project for predicting the future demand of Bikes.
I used BootML for this exercise and following are the results from BootML:

RMSE selected as the performance measure for the supervised learning algorithm.

Linear Regression: RMSE: 142.554 Cross Validation (Mean): 142.6 Std Dev: 3.951
Decision Tree: RMSE: 0.598 Cross Validation (Mean): 60.398 Std Dev: 1.844
Random Forest: - RMSE: 16.07 Cross Validation (Mean): 43.761 Std Dev: 1.60

Random Forest Final RMSE: 41.4055

Following process followed to train and refine the model:

  1. Discarded the columns dteday, casual, instance and registered. Used Cnt as the label while rest of the columns were used as features.
  2. No stratified sampling applied.
  3. All features standardised in feature sampling and Median was used an imputes for the missing values.
  4. Grid search was used as the hyperparameter fine tuning.
  5. Random seed -42, Data split was 80-20 and CV fold of 10 was used.

Random Tree showed the best performance and hence fine tuned to get the final RMSE. Has low mean variance and Std deviation compared to the other two models.

2 Likes

I used BootML to predict the demands of bikes in future. Among the attributes, few of them seemed to be irrelevant to affect cnt variable. By the way, cnt is the label and instant, dteday, atemp, casual and registered are the attributes, seemed not to be have big impact on the label.
I used 80-20% division on training and test data. I did not use stratified sampling on any attributes.
I used 3 models - Linear Regression, Decision Tree and Random Forest to validate the training data. I found the following -
Linear Regression:
RMSE - 142.55471466066635; Standard Deviation - 3.9518468729854312
Decision Tree:
RMSE - 0.5989453436724405; Standard deviation: 1.8183706101648907
Random Forest:
RMSE - 16.062107233125182; Standard deviation: 1.589774803493087

Here, Linear Regression has a very high RMSE from the training set and thus probably underfitting the test.
Decision Tree, on the other hand, has quite lower RMSE and low Standard Deviation. It is overfitting the test result.
Random Forest is above all, has moderate RMSE and standard deviation. The final RMSE after validation is coming as 41.379972879722175. So, Random Forest is the model that fits the training test data.

Name of the Project : Predict the bike demand in future

Technique Used: Supervised learning technique is used. The following are used for creation of the Machine Learning Model:

Columns Selected: All the columns of the dataset are selected
Splitting : The ratio of training to testing data is 80:20

The following Models are used to develop the ML algorithms:

** Linear Regression : RMSE :**

Median 0.0054
Standard Deviation 0.0018

After Hyperparameter Tuning

Root Mean Squared Error 823.150968

** Boosted Decision Tree Regression **

Median 114.6348
Standard Deviation 33.4545

After Hyperparameter Tuning

Root Mean Squared Error 804.039516

** Decision Forest Regression **

Median:142.8445
Standard Deviation:46.265

After Hyperparameter Tuning

Median 638.7702
Standard Deviation NaN

Analysis:
For Linear Regression, the initial RMSE = 0.0054, but after Hyper Parameter Tuning it’s 823.150968. This shows that the model is highly overfitting.
For Boosted Decision Tree Regression, the RMSE = 114.6348, but after Hyper Parameter Tuning it’s 804.039516. Though this is lesser than Linear Regression, though RMSE is still high, but we can check with one more model.
For Decision Forest Regression, the RMSE = 142.8445, 7 after Hyperparameter tuning it’s 638.7702. This looks to be least overfitting.

On the basis of above results: it is inferred that the Decision Forest Regression is the best algorithm to be used as it’s RMSE after Hyper parameter tuning is 638.7702 which is lowest among the above three algorithms.

1 Like

Experiment in AzureML

  • The total bike demand (cnt) is sum of registered and casual bike demand. Label “cnt”
  • Columns, index and registered and casual have been removed
  • Splitting data using 80-20 rule for training and test data respectively
  • Normalize only: temp; atemp; hum; windspeed
  • Data shows positive correlation to temp and atemp

linear regression Model

        Mean Absolute Error			666.2228
        Root Mean Squared Error		958.536074
        Relative Absolute Error		0.386401
        Relative Squared Error		0.225061
        Coefficient of Determination	0.774939

Decision tree regresio Model
Mean Absolute Error 509.538286
Root Mean Squared Error 762.644702
Relative Absolute Error 0.295526
Relative Squared Error 0.142471
Coefficient of Determination 0.857529

1 Like

Objective: To predict the bike demand in future, taking into consideration the available historical data.

Process: Supervised Learning using Regression

Label: CNT

Split: Training and Test set is 4: 1

Regression Model:

  • Linear Regression
    • Value: 142.55
    • Error: 3.95
    • Cross Validation Value: 142.6
  • Decision Tree
    • Value: 0.60
    • Error: 1.94
    • Cross Validation Value: 60.4
  • Random Forrest
    • Value: 16.07
    • Error: 1.60
    • Cross Validation Value: 43.74

Feature Scaling: Min_Max

Observation: Overfitting in Decision Tree Model.

Best Estimator: Random Forrest

Final Value: 41.43

1 Like

This supervised learning regression algorithm model aims to predict the bike demand in future. The project aims to build a model that accurately can predict the bike demands depending upon the previous data provided in csv file.
RMSE is used as measure of performance of model. Data is preprocessed to get rid of irrelevant values and “cnt” label was used with features relevant for model. Data was divided into training and test set and with stratified sampling for weathersit to include all seasons equally. Correlations observed and visualized. All three models i.e linear regression, decision tree and random forest tried and random forest was selected as preferred model with optimal RMSE value among others.

1 Like

In response to Predict the bike demand in future task:

Following the checklist sequence of approaching ML projects, I used BootML to build, select and fine tune the model as below:

  1. I looked at the big picture and thought what could be factors affecting bike demand like weather, day of the week, temperature, humidity etc. and included all the numeric and categorical variables in our training model like:
    ‘season’, ‘yr’, ‘mnth’, ‘hr’, ‘holiday’, ‘weekday’, ‘workingday’, ‘weathersit’, ‘temp’, ‘atemp’, ‘hum’, ‘windspeed’, ‘casual’, ‘registered’
  2. I excluded few features which seemed not relevant such as instant, dteday.
  3. I explored the data and I confirmed my assumptions that there is a positive correlation with temperature and season and found out negative correlation with humidity
  4. I prepared the data for ML by replacing the missing values with median value, stratified the data based on cnt, performed feature scaling like MinMax and onehot encoding.
  5. I built 3 ML models, using Linear Regression, Decision Tree and Random Forest algorithms respectively. I received the following results for RMSE and 10K-fold cross-validation mean RMSE and Standard deviation:

Linear Regression: Mean: 2.746549874518094e-13
Standard deviation: 7.264839151676053e-14
Decision Tree: Mean: 4.963098305737335
Standard deviation: 0.4520360574099402
Random Forest: Mean: 2.720417962335797
Standard deviation: 0.5882269421315144

From these results, I concluded that the Linear Regression model didn’t perform well as it had high RMSE. The Decision Tree algorithm showed very significant difference which indicates overfitting of the model as it failed to generalize well during the cross validation. I evaluated and selected the Random Forest as the best model from the 3 in this case.
6) BootMl has also selected this model for fine-tuning and has performed hyperparameters tuning through Grid Search algorithm as I’ve selected previously.
7) The model showed that the feature importance score for prediction is as follows:

[(0.5122454630914193, ‘registered’),
(0.20204306899303562, ‘casual’),
(0.03446798497460259, ‘atemp’),
(0.022371327661383614, ‘hum’),
(0.0223399114385595, ‘temp’),
(0.015778073788105217, ‘holiday’),
(0.015391716766457888, ‘weekday’),
(0.009194736994804135, ‘season’),
(0.004040220539153114, ‘windspeed’),
(0.002167437802887816, ‘hr’),
(0.0009007268616543127, ‘mnth’),
(0.0008471652652526795, ‘yr’),
(0.0008217685569335782, ‘weathersit’),
(0.0005468359549292747, ‘workingday’)]

which indicates that the registered and casual users which contributes the total demand as a whole and temperature, humidity are the highest contributors to affecting total demand.
8) In this case few of the features are almost same in nature, so we can drop few of them which has high auto correlation for building a good model without bias.
The final RMSE after tuning was 15.015772087397394
9) If this RMSE is satisfactory, the model could be deployed. Otherwise further data can be added and the models re-run or other algorithms can be used with the same data in efforts to build a better performing model.

1 Like

I have used bootml for project work followed the steps.

Process: Supervised Learning using Regression

Discarded irrelevent arttributes from bike.csv are:

instant, dteday, casual and registered

The relevant attributes in bikes.csv are :

season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed, cnt,

Features are:

season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed

Lable is:

cnt

Split Train data and Test data in ratio of 80:20

Linear Regression:
RMSE: 142.59348949582764
CV Mean: 142.6810496457205
SD: 3.5551087849018335

Decision Tree:
RMSE: 0.45367266614794205
CV Mean: 60.54662580921443
SD: 2.8581256218893256

Random Forest:
RMSE: 16.182445908017026
CV Mean: 43.98548586547771
SD: 1.988039554058191

Thus Best estimater selected is Random Forest and thus this model is selected for fine-tuning.

After fine tuning the Final RMSE (Random Forest): 40.306552845037864

1 Like

Objective: Predict the bike demand in the future by creating a suitable model based on the dataset bikes.csv (Bikes_Data_1)

Used BootML.
Type of project: Supervised Learning
Type of Supervised Learning: Regression
Performance Measure: Mean Squared Error

Discarded features that are not relevant for analysis and prediction:
instant, dteday, casual, registered

Here Label is ‘cnt’ - ie the count of bike demand

Data has been split into two sets - Training and test. The split ratio is 80:20
Random seed used is 42.

Stratified sampling is not used.

All the fields are numerical fields.

Data imputation is done by data cleaning and feature scaling. Missing values are replaced with ‘mean’, and Feature scaling is done using the ‘standardization’ method

Number of Folds for cross validation is 10.

Algorithms - Linear Regression, Random Forest and Decision Tree are used to train the models
Hyper parameter fine tuning is done using GridSearch

After the above configurations, jupyter code is generated.

After the code is executed, the following results are obtained.

Linear Regression Model:
RMSE: 142.45226586527895
Mean: 142.50186756941565
Std Dev: 4.00132932767643

Decision Tree model:
RMSE: 0.5989453436724405
Mean: 60.47993567869496
Stddev: 2.2981704508600087

Random Forest Regressor Model:
RMSE: 15.983715666257496
Mean: 43.532095532879495
Stddev: 1.7073796475529766

Fine-tune the Random forest Regressor model.
Final RMSE after fine-tuning: 41.13326718948123

From the above results, we can infer the following:
In case of Linear Regression model, the RMSE is very high compared to decision tree and Random forest. So, we drop this model.
In case of Decision Tree model, we see though the RMSE is low (the lowest among the 3 models), but the cross-validation mean is high indicating overfitting
In case of Random Forest regressor, even though RMSE is higher than Decision Tree, the cross-validation mean’s variation - though overfitting - but seems to be reasonable when compared to decision-tree model. The overfitting can be reduced by using additional data. So, we select this as the final model for fine-tuning.

The final RMSE for RandomForestRegressor model after fine-tuning is 41.13326718948123

If this performance looks okay, then this model can be deployed in production.

Otherwise, review quality of data, and use additional data and re-train and fine-tune the model for better accuracy.

1 Like

Project Name - Predict the bike demand in future

Data Set - ml/machine_learning/datasets/bike_sharing at master · cloudxlab/ml · GitHub

Selected BootML for the bike project

Discarded instant, dteday

Label cnt

Training set and test set split as 80:20

Used median as the imputer

All data are numerical. So didn’t move any data to categorical

Used standardization for feature scaling

Used algorithms - Linear Regression, Decision Tree and Random Forest

Bikes_Assessment_1_ntsreejith1762 - Jupyter Notebook (cloudxlab.com)

season 0.184377

yr 0.255502

mnth 0.127409

hr 0.391871

holiday -0.034094

weekday 0.028801

workingday 0.027403

weathersit -0.144581

temp 0.403476

atemp 0.399118

hum -0.324475

windspeed 0.088802

casual 0.693962

registered 0.971979

cnt 1.000000

Humidity, weathersit and holiday have negative correlation

Temperature positive correlation

Linear Regression

RMSE 142.55471466066638

Mean: 142.60048335718665

Standard deviation: 3.951846872985428

Decision tree RMSE

RMSE 0.5989453436724405

Mean: 60.389922919513765

Standard deviation: 1.8442561348658812

Random Forest

RMSE 16.070527909994066

Mean: 43.761209948247476

Standard deviation: 1.601296445040747

importance score of each attribute in GridSearchCV

(0.5940383211759476, ‘hr’),

(0.1314648370277314, ‘temp’),

(0.08115055651157858, ‘yr’),

(0.054600523654828356, ‘workingday’),

(0.038212596573476344, ‘hum’),

(0.024857598473005206, ‘season’),

(0.023493412411162557, ‘weekday’),

(0.02033753184618983, ‘mnth’),

(0.016669847446518997, ‘weathersit’),

(0.012790932124135795, ‘windspeed’),

(0.0023838427554254754, ‘holiday’)

Final RMSE

41.40554866848994

Analysis

Best performance is from Random Tree algorithm, which got fine tuned to get the final RMSE.

Random tree has the lowest MEAN and Standard deviation compared to the Linear Regression and Decision Tree

Used BootML for this project.

Chose the following values:-

Type of project: Supervised Learning
Type of Supervised Learning: Regression
Performance Measure: Mean Squared Error
Dataset selection: Used existing bike dataset
Discarded fields: ‘instant’, ‘dteday’, ‘atemp’, ‘casual’, ‘registered’
Features: ‘season’, ’ yr’, ’ mnth’, ’ hr’, ’ holiday’, ’ weekday’, ’ workingday’, ’ weathersit’, ’ temp’, ’ hum’, ’windspeed’
Label - ‘cnt’
Split Data - Used the recommended 80:20 ratio for training and test; random seed was prefilled as 42 so left it as is; used stratified sampling for weathersit to include all seasons

Fixed missing values using imputer with median and used standardization since feature scaling technique is selected.

Entered number of folds for cross validation as 10 and selected all 3 algorithms - Linear Regression, Random Forest and Decision Tree.

Ran the generated Jupyter notebook and tested the models using the 3 algorithms.

Objective of Bike Demand project is to predict the bike demand in the future by creating a suitable model based on the already existing data. We are using the dataset which contains the hourly rental bike demand data.

This prediction system falls under Supervised learning and regression algorithms are used to train and build the model.

The performance measure is Root Mean Square Error. The data file used is bikes.csv.

Irrelevant features like instant, dteday, atemp, casual and registered are discarded from the input file.

For Prediction for the bike demand, Cnt is selected as the label here.

Then we split the data into train:test as 80:20 with Seed = 42.

Visualize the data by checking the correlation between different features by selecting the kind of visualization, and then generate the correlations and scatter matrix.

Imputation of the data by data cleaning and feature scaling. The missing values are replaced with median and we selected standardization as the feature scaling technique.

Cross validation (10 folds) and with algorithms like Linear Regression, Random Forest and Decision Tree is carried out.
Hyperparameters are fine-tuned by selecting the grid search.

Finally, generate the machine learning code in the Jupyter notebook.

Then run the code and analyzed the model performance from the result.

The RMSE for the liner regression model is 142.55471466066638

Mean: 142.60048335718665

Standard deviation: 3.951846872985485

The RMSE for the Decision tree model is 0.5989453436724405

Mean: 60.40019917027024

Standard deviation: 1.9342512865620638

The RMSE for the Random forest model is 16.070058811417443

Mean: 43.73715385071471

Standard deviation: 1.6048984827096304

The final RMSE for the Random forest model after the fine tuning is

41.430448042559405

By comparing the various model we can see the standard is less in the case of Random forest model, which is also selected here and fine-tuned and the final RMSE is 41.43.

1 Like

Used BootML to build the model

  • Dropped Column - instant’, ‘dteday’, ‘year’
  • Label is cnt
    · There are no categorical fields in the dataset

Model Performance:

Linear Regression:
RMSE: 1.5917716337858432e-13
Cross validation Scores:
Mean: 2.1201973299296864e-13
Standard deviation: 6.532873981272535e-14

Decision Tree:
RMSE: 0.0
Cross validation Scores:
Mean: 5.384855596737983
Standard deviation: 0.6401427427741454

Random ForestL
RMSE: 1.0364170995450157
Cross validation Scores:
Mean: 2.7974040434406398
Standard deviation: 0.6818323846906849

Random Forest Regressor is the top performer, so we select RF for fine-tuning

Post hyper parameter tuning and the Final RMSE was 2.6717032775073686

1 Like

The goal is to develop a model to estimate the bike demand in future given the parameters as observed in the past. The dataset contains the hourly rental bike demand data.
Steps followed :

  1. Import the data. Analyze the data.
  2. Drop the irrelevant fields.
  3. Understand the data.
  4. Split the data into train and test.
  5. Analyze the data through visualizations.
  6. Preprocess the data for modelling (Data Cleaning, Feature Scaling).
  7. Train the model.
  8. Fine-tune the model.
  9. Validate the models such as using RMSE and select the best model .
1 Like

Project topic: Predict the bike demand in future

To train a ML model that could predict the demand of the bike in future, a dataset was needed which was available at ‘/cxldata/datasets/bootml/Bikes_Data_1’. In this project, we used BootML to train a model that would use three different algorithms, namely Linear regression, Decision Tree and random forest for the prediction and the goal is to select the best algorithm which is more accurate. The accuracy of the algorithm will depend upon the Root Mean Square Error (RMSE). The algorithm with least RMSE will be the best one for the model.

This project could be completed using Supervised Regression, a type of training used to develop ML modals. Why Supervised Regression? Because we are predicting the values that are most likely to appear. After selecting the training type, we clean our data by discarding unwanted fields so to reduce the time and space complexity for the training. In this project, we discarded various fields like ‘instant’, ‘dteday’, ‘atemp’, ‘casual’, ‘registered’. When we use supervised learning, we need to define “Features” which are used by the algorithm to find patterns and “Label” which is the field that we want to predict. Here, our features were ‘season’, ‘yr’, ‘mnth’, ‘hr’, ‘holiday’, ‘weekday’, ‘workingday’, ‘weathersit’, ‘temp’, ‘hum’, ‘windspeed’ while our label was ‘cnt’. After categorizing the features and labels, if the dataset is huge, we need to split the data in smaller chunks so that we can manage it according to the systems RAM, ROM and processor, else we would directly feed the data to the memory. The data is then divided into the ratio of 80:20 where 80% data is used for training and rest 20% data is used to validate the model.

Then, we plot graphs to visualize the data so that we can understand on a deeper level about the cofactors. In this project, after visualizing the data, it was found that the negative cofactors affecting the bike demand were ‘holidays’, ‘weathersit’ and ‘humidity’. After this step, we proceed to use the library ‘Scikit-Learn Imputer class’ to fill missing values to clean our data. For any missing values, we substitute the missing value with the median of the entire column. We then remove the text attributes from the dataset because the class imputer works on numerical attributes only and not categorical values. We use ‘standard_scaling’ from the library Sci-kit learn to scale our data and all the numerical values like ‘yr’, ‘hr’ ,‘temp’ ,‘hum’ ,‘windspeed’ are scaled.

Since our data is ready, we can feed it to the ML models and check which of the models has least RMSE. First, we train a linear regression model. After training we obtain the following results:
RMSE: 140.42543515709906
Mean: 140.66747951182725
Standard deviation: 4.073381490301536

Now we train a Decision tree modal and obtain the following results:
RMSE: 0.5989453436724405
Mean: 61.75452780128436
Standard deviation: 1.9890246206300253

Lastly, we train a Random Forest model and obtain the following results:
RMSE: 16.136436478906223
Mean: 43.85423021350333
Standard deviation: 1.6025089829841799

After observation, we find that the RMSE of Decision Tree Model is 0.5989… which is very low compared to Random Forest, which is 16.1364… and Linear Regression, which is 140.4245… One might say that the Decision Tree model is the best one for having the lowest RMSE but that’s not true. While the RMSE for Decision Tree is 0.5989… its mean is 61.7545… which is far away from the RMSE value, which indicates that this model is Overfitting, Hence we discard the Decision tree model. Now we have Linear Regression model and Random Forest model to choose from. Once again, Random Forest model is an overfitting model because it has a huge difference between the RMSE and its mean while the linear regression has both RMSE and mean values close to each other. Although the Linear Regression model performed well, it has a huge RMSE while Random Forest model can be tuned by training the model with more data.

It can be concluded that although Linear Regression had close RMSE and mean values, the Random forest appears to be more promising in providing better predictions. The final RMSE for the Random Forest model was 52.01065305006697

1 Like

Predict the bike demand in future

Here we are required to build the model which estimates the bike demand in future given the parameters as observed in the past.
First we observer the data provided as csv .
Second we try to visualize the data.
Later we train the model using

  • LinearRegression - RMSE=142.55471466066638 and cross validation

Scores: [141.73385995 137.17408611 146.54823227 140.01681714 139.5934433
140.75983645 147.21587309 146.94230424 148.04977578 137.97060524]
Mean: 142.60048335718665
Standard deviation: 3.951846872985485

  • DecisionTreeRegressor RMSE= 0.5989453436724405
    and cross validation
    Scores: [61.83845429 61.87772316 58.57910773 58.97797571 60.57058431 59.66936478
    58.00785112 60.66513375 58.99932935 64.81646751]
    Mean: 60.40019917027024
    Standard deviation: 1.9342512865620638
  • RandomForest RMSE=16.070058811417443

and cross validation Scores: [41.39109047 44.91729434 45.62312915 43.93183516 43.34900343 41.97845154
43.13046448 44.21152565 42.1283239 46.71042038]
Mean: 43.73715385071471
Standard deviation: 1.6048984827096304

Finally we fine tune the Model using grid search
and got
Best Param - {‘max_features’: 8, ‘n_estimators’: 30}
Best estimation: Random Forest

And we receive Final RMSE= 41.430448042559405

Project Objective : Predict the bike demand in future

First I tried looking at the bigger picture by

A)Understanding the business objective and how the created solution will help the business.
B)We will be using Supervised Regression as we need to predict median house value which is a continuous variable and we already have the input features and the expected labels in the dataset which means it is a supervised problem.
I figured out that I need to build different models and choose the best one based on the performance metric chosen.
C) I chose root mean squared error as the performance metric.A typical performance measure for regression problems is the Root Mean Square Error (RMSE) which is the square root of Mean Squared Error.
The mean squared error basically is the sum of squares of the errors in each prediction.

In this project, I used BootML to train a model that would use three different algorithms, namely Linear regression, Decision Tree and random forest for the prediction and the goal is to select the best algorithm which is more accurate.

  1. Import the data :

    • Load the dataset from the specified path ‘/cxldata/datasets/bootml/Bikes_Data_1’.
  2. Analyze the data :

    • Explore the dataset to understand its structure and features.
  3. Drop the irrelevant fields:

    • Discard unwanted fields such as ‘instant’, ‘dteday’, ‘atemp’, ‘casual’, ‘registered’ to reduce complexity.
  4. Understand the data

    • Identify the features and the target variable(label).
    • Here, our features were ‘season’, ‘yr’, ‘mnth’, ‘hr’, ‘holiday’, ‘weekday’, ‘workingday’, ‘weathersit’,
      ‘temp’, ‘hum’, ‘windspeed’ while our label was ‘cnt’.
    • Understand the distribution and relationship between variables.
  5. Split the data into train and test

    • Divide the dataset into training and testing sets, typically using an 80:20 ratio.
    • I stratified the data based on weathersit.
  6. Analyze the data through visualizations

    • Plot graphs to visualize the relationships between different variables to gain deeper understanding
      of the cofactors.
    • Identify any patterns or trends in the data.
    • After visualizing the data, it was found that the negative cofactors affecting the bike demand were
      ‘holidays’, ‘weathersit’ and ‘humidity’.
  7. Data Cleaning & Preprocessing

    • Clean the data by handling missing values using techniques like imputation.
    • Using the library ‘Scikit-Learn Imputer class’ to fill missing values to clean our data. For any missing values, we substitute the missing value with the mean/median/zero.If there are excess missing values we delete the specific rows or the entire column.
    • We then remove the text features from the dataset because the class imputer works on numerical attributes only and not categorical features.
    • I selected “Median” as the imputer.
    • Scale the numerical features using standard feature scaling to bring them to a similar scale.All the numerical values like ‘yr’, ‘hr’ ,‘temp’ ,‘hum’ ,‘windspeed’ are scaled.

Our Data is ready to be fed to the ML Algorithms. We will feed the data to different models and figure out which model scores best in terms of chosen performance metric.

  1. Train the model
    • Train machine learning models using different algorithms like Linear Regression, Decision Tree, and Random Forest.

After training different ML Models we obtain the following results:

Linear regression model.
RMSE: 142.72311067462795
Mean: 142.77813742724223
Standard deviation: 3.7030212904153004

Decision tree model :
RMSE: 0.5989453436724405
Mean: 59.78945853931843
Standard deviation: 3.251904916375031

Random Forest model :
RMSE: 15.984077447144026
Mean: 43.49698996838114
Standard deviation: 2.4391843492444463

  1. Fine-tune the model
    • Tune hyperparameters of the models using techniques like GridSearchCV or RandomizedSearchCV to improve performance.

The best hyperparameter combinations

grid_search.best_params_ :{‘max_features’: 8, ‘n_estimators’: 30}

Importance score of each attribute:
[(0.5844748088424341, ‘hr’),
(0.12946274868458668, ‘temp’),
(0.07775038801366478, ‘yr’),
(0.05910890006087216, ‘workingday’),
(0.039755394080211844, ‘hum’),
(0.028163876532795794, ‘season’),
(0.027393454181294748, ‘weekday’),
(0.01926425477094257, ‘mnth’),
(0.01890966484897769, ‘weathersit’),
(0.013190030433310636, ‘windspeed’),
(0.0025264795509088818, ‘holiday’)]

This indicates that the hr and temp are the highest contributors to affecting demand.

Upon analysis, it’s clear that while the Decision Tree model initially appears favorable due to its low RMSE, further examination reveals significant overfitting indicated by the large disparity between its RMSE and mean values. Consequently, we discard this model.

Turning to the Linear Regression and Random Forest models, while the former exhibits a closer alignment between RMSE and mean values, it suffers from a high RMSE. In contrast, although the Random Forest model displays overfitting, its potential for improvement through additional data training makes it a more promising option.

In summary, despite the Linear Regression model’s closer alignment between RMSE and mean values, the Random Forest model holds greater potential for delivering better predictions.

Best Estimator : Random Forest

  1. Evaluating the model on Test Set
    • Finally I validated the model using test data to ensure its generalization ability.

Final RMSE : 41.96929332427234

Objective: Predict the bike demand in the future by creating a suitable model based on the dataset bikes.csv (Bikes_Data_1)

Used BootML.
Type of project: Supervised Learning
Type of Supervised Learning: Regression
Performance Measure: Mean Squared Error

Discarded features that are not relevant for analysis and prediction:
instant, dteday, casual, registered

Here Label is ‘cnt’ - ie the count of bike demand

Data has been split into two sets - Training and test. The split ratio is 80:20
Random seed used is 42.

Stratified sampling is not used.

All the fields are numerical fields.

Data imputation is done by data cleaning and feature scaling. Missing values are replaced with ‘mean’, and Feature scaling is done using the ‘standardization’ method

Number of Folds for cross validation is 10.

Algorithms - Linear Regression, Random Forest and Decision Tree are used to train the models
Hyper parameter fine tuning is done using GridSearch

After the above configurations, jupyter code is generated.

After the code is executed, the following results are obtained.

We employed Linear Regression, Decision Tree, and Random Forest algorithms to predict bike demand. After evaluating their performance, the following insights were gathered:

  1. Linear Regression Model:
  • RMSE: 142.45227
  • Mean: 142.50187
  • Std Dev: 4.00133The RMSE is notably high compared to the other models, indicating lower predictive accuracy. Therefore, this model was disregarded for further consideration.
  1. Decision Tree Model:
  • RMSE: 0.59895
  • Mean: 60.47994
  • Stddev: 2.29817While the RMSE is the lowest among the three models, the cross-validation mean is relatively high, suggesting potential overfitting issues. Hence, the suitability of this model for deployment in production is questioned.
  1. Random Forest Regressor Model:
  • RMSE: 15.98372
  • Mean: 43.53210
  • Stddev: 1.70738Although the RMSE is higher compared to the Decision Tree model, the variation in cross-validation mean, albeit indicative of overfitting, appears reasonable. Additionally, the overfitting can be mitigated by incorporating additional data. Hence, this model was selected for fine-tuning.

After fine-tuning the Random Forest Regressor model, the final RMSE improved to 41.13327. If this level of performance is acceptable, the model can proceed to deployment in production. However, if higher accuracy is desired, it is recommended to reassess the quality of the data, consider incorporating additional data, and retrain the model for improved performance.