Predict the bike demand in future

The goal is to develop a model to estimate the bike demand in future given the parameters as observed in the past. The dataset contains the hourly rental bike demand data.
Steps followed :

  1. Import the data. Analyze the data.
  2. Drop the irrelevant fields.
  3. Understand the data.
  4. Split the data into train and test.
  5. Analyze the data through visualizations.
  6. Preprocess the data for modelling (Data Cleaning, Feature Scaling).
  7. Train the model.
  8. Fine-tune the model.
  9. Validate the models such as using RMSE and select the best model .
1 Like

Project topic: Predict the bike demand in future

To train a ML model that could predict the demand of the bike in future, a dataset was needed which was available at ‘/cxldata/datasets/bootml/Bikes_Data_1’. In this project, we used BootML to train a model that would use three different algorithms, namely Linear regression, Decision Tree and random forest for the prediction and the goal is to select the best algorithm which is more accurate. The accuracy of the algorithm will depend upon the Root Mean Square Error (RMSE). The algorithm with least RMSE will be the best one for the model.

This project could be completed using Supervised Regression, a type of training used to develop ML modals. Why Supervised Regression? Because we are predicting the values that are most likely to appear. After selecting the training type, we clean our data by discarding unwanted fields so to reduce the time and space complexity for the training. In this project, we discarded various fields like ‘instant’, ‘dteday’, ‘atemp’, ‘casual’, ‘registered’. When we use supervised learning, we need to define “Features” which are used by the algorithm to find patterns and “Label” which is the field that we want to predict. Here, our features were ‘season’, ‘yr’, ‘mnth’, ‘hr’, ‘holiday’, ‘weekday’, ‘workingday’, ‘weathersit’, ‘temp’, ‘hum’, ‘windspeed’ while our label was ‘cnt’. After categorizing the features and labels, if the dataset is huge, we need to split the data in smaller chunks so that we can manage it according to the systems RAM, ROM and processor, else we would directly feed the data to the memory. The data is then divided into the ratio of 80:20 where 80% data is used for training and rest 20% data is used to validate the model.

Then, we plot graphs to visualize the data so that we can understand on a deeper level about the cofactors. In this project, after visualizing the data, it was found that the negative cofactors affecting the bike demand were ‘holidays’, ‘weathersit’ and ‘humidity’. After this step, we proceed to use the library ‘Scikit-Learn Imputer class’ to fill missing values to clean our data. For any missing values, we substitute the missing value with the median of the entire column. We then remove the text attributes from the dataset because the class imputer works on numerical attributes only and not categorical values. We use ‘standard_scaling’ from the library Sci-kit learn to scale our data and all the numerical values like ‘yr’, ‘hr’ ,‘temp’ ,‘hum’ ,‘windspeed’ are scaled.

Since our data is ready, we can feed it to the ML models and check which of the models has least RMSE. First, we train a linear regression model. After training we obtain the following results:
RMSE: 140.42543515709906
Mean: 140.66747951182725
Standard deviation: 4.073381490301536

Now we train a Decision tree modal and obtain the following results:
RMSE: 0.5989453436724405
Mean: 61.75452780128436
Standard deviation: 1.9890246206300253

Lastly, we train a Random Forest model and obtain the following results:
RMSE: 16.136436478906223
Mean: 43.85423021350333
Standard deviation: 1.6025089829841799

After observation, we find that the RMSE of Decision Tree Model is 0.5989… which is very low compared to Random Forest, which is 16.1364… and Linear Regression, which is 140.4245… One might say that the Decision Tree model is the best one for having the lowest RMSE but that’s not true. While the RMSE for Decision Tree is 0.5989… its mean is 61.7545… which is far away from the RMSE value, which indicates that this model is Overfitting, Hence we discard the Decision tree model. Now we have Linear Regression model and Random Forest model to choose from. Once again, Random Forest model is an overfitting model because it has a huge difference between the RMSE and its mean while the linear regression has both RMSE and mean values close to each other. Although the Linear Regression model performed well, it has a huge RMSE while Random Forest model can be tuned by training the model with more data.

It can be concluded that although Linear Regression had close RMSE and mean values, the Random forest appears to be more promising in providing better predictions. The final RMSE for the Random Forest model was 52.01065305006697

1 Like

Predict the bike demand in future

Here we are required to build the model which estimates the bike demand in future given the parameters as observed in the past.
First we observer the data provided as csv .
Second we try to visualize the data.
Later we train the model using

  • LinearRegression - RMSE=142.55471466066638 and cross validation

Scores: [141.73385995 137.17408611 146.54823227 140.01681714 139.5934433
140.75983645 147.21587309 146.94230424 148.04977578 137.97060524]
Mean: 142.60048335718665
Standard deviation: 3.951846872985485

  • DecisionTreeRegressor RMSE= 0.5989453436724405
    and cross validation
    Scores: [61.83845429 61.87772316 58.57910773 58.97797571 60.57058431 59.66936478
    58.00785112 60.66513375 58.99932935 64.81646751]
    Mean: 60.40019917027024
    Standard deviation: 1.9342512865620638
  • RandomForest RMSE=16.070058811417443

and cross validation Scores: [41.39109047 44.91729434 45.62312915 43.93183516 43.34900343 41.97845154
43.13046448 44.21152565 42.1283239 46.71042038]
Mean: 43.73715385071471
Standard deviation: 1.6048984827096304

Finally we fine tune the Model using grid search
and got
Best Param - {‘max_features’: 8, ‘n_estimators’: 30}
Best estimation: Random Forest

And we receive Final RMSE= 41.430448042559405

Project Objective : Predict the bike demand in future

First I tried looking at the bigger picture by

A)Understanding the business objective and how the created solution will help the business.
B)We will be using Supervised Regression as we need to predict median house value which is a continuous variable and we already have the input features and the expected labels in the dataset which means it is a supervised problem.
I figured out that I need to build different models and choose the best one based on the performance metric chosen.
C) I chose root mean squared error as the performance metric.A typical performance measure for regression problems is the Root Mean Square Error (RMSE) which is the square root of Mean Squared Error.
The mean squared error basically is the sum of squares of the errors in each prediction.

In this project, I used BootML to train a model that would use three different algorithms, namely Linear regression, Decision Tree and random forest for the prediction and the goal is to select the best algorithm which is more accurate.

  1. Import the data :

    • Load the dataset from the specified path ‘/cxldata/datasets/bootml/Bikes_Data_1’.
  2. Analyze the data :

    • Explore the dataset to understand its structure and features.
  3. Drop the irrelevant fields:

    • Discard unwanted fields such as ‘instant’, ‘dteday’, ‘atemp’, ‘casual’, ‘registered’ to reduce complexity.
  4. Understand the data

    • Identify the features and the target variable(label).
    • Here, our features were ‘season’, ‘yr’, ‘mnth’, ‘hr’, ‘holiday’, ‘weekday’, ‘workingday’, ‘weathersit’,
      ‘temp’, ‘hum’, ‘windspeed’ while our label was ‘cnt’.
    • Understand the distribution and relationship between variables.
  5. Split the data into train and test

    • Divide the dataset into training and testing sets, typically using an 80:20 ratio.
    • I stratified the data based on weathersit.
  6. Analyze the data through visualizations

    • Plot graphs to visualize the relationships between different variables to gain deeper understanding
      of the cofactors.
    • Identify any patterns or trends in the data.
    • After visualizing the data, it was found that the negative cofactors affecting the bike demand were
      ‘holidays’, ‘weathersit’ and ‘humidity’.
  7. Data Cleaning & Preprocessing

    • Clean the data by handling missing values using techniques like imputation.
    • Using the library ‘Scikit-Learn Imputer class’ to fill missing values to clean our data. For any missing values, we substitute the missing value with the mean/median/zero.If there are excess missing values we delete the specific rows or the entire column.
    • We then remove the text features from the dataset because the class imputer works on numerical attributes only and not categorical features.
    • I selected “Median” as the imputer.
    • Scale the numerical features using standard feature scaling to bring them to a similar scale.All the numerical values like ‘yr’, ‘hr’ ,‘temp’ ,‘hum’ ,‘windspeed’ are scaled.

Our Data is ready to be fed to the ML Algorithms. We will feed the data to different models and figure out which model scores best in terms of chosen performance metric.

  1. Train the model
    • Train machine learning models using different algorithms like Linear Regression, Decision Tree, and Random Forest.

After training different ML Models we obtain the following results:

Linear regression model.
RMSE: 142.72311067462795
Mean: 142.77813742724223
Standard deviation: 3.7030212904153004

Decision tree model :
RMSE: 0.5989453436724405
Mean: 59.78945853931843
Standard deviation: 3.251904916375031

Random Forest model :
RMSE: 15.984077447144026
Mean: 43.49698996838114
Standard deviation: 2.4391843492444463

  1. Fine-tune the model
    • Tune hyperparameters of the models using techniques like GridSearchCV or RandomizedSearchCV to improve performance.

The best hyperparameter combinations

grid_search.best_params_ :{‘max_features’: 8, ‘n_estimators’: 30}

Importance score of each attribute:
[(0.5844748088424341, ‘hr’),
(0.12946274868458668, ‘temp’),
(0.07775038801366478, ‘yr’),
(0.05910890006087216, ‘workingday’),
(0.039755394080211844, ‘hum’),
(0.028163876532795794, ‘season’),
(0.027393454181294748, ‘weekday’),
(0.01926425477094257, ‘mnth’),
(0.01890966484897769, ‘weathersit’),
(0.013190030433310636, ‘windspeed’),
(0.0025264795509088818, ‘holiday’)]

This indicates that the hr and temp are the highest contributors to affecting demand.

Upon analysis, it’s clear that while the Decision Tree model initially appears favorable due to its low RMSE, further examination reveals significant overfitting indicated by the large disparity between its RMSE and mean values. Consequently, we discard this model.

Turning to the Linear Regression and Random Forest models, while the former exhibits a closer alignment between RMSE and mean values, it suffers from a high RMSE. In contrast, although the Random Forest model displays overfitting, its potential for improvement through additional data training makes it a more promising option.

In summary, despite the Linear Regression model’s closer alignment between RMSE and mean values, the Random Forest model holds greater potential for delivering better predictions.

Best Estimator : Random Forest

  1. Evaluating the model on Test Set
    • Finally I validated the model using test data to ensure its generalization ability.

Final RMSE : 41.96929332427234

Objective: Predict the bike demand in the future by creating a suitable model based on the dataset bikes.csv (Bikes_Data_1)

Used BootML.
Type of project: Supervised Learning
Type of Supervised Learning: Regression
Performance Measure: Mean Squared Error

Discarded features that are not relevant for analysis and prediction:
instant, dteday, casual, registered

Here Label is ‘cnt’ - ie the count of bike demand

Data has been split into two sets - Training and test. The split ratio is 80:20
Random seed used is 42.

Stratified sampling is not used.

All the fields are numerical fields.

Data imputation is done by data cleaning and feature scaling. Missing values are replaced with ‘mean’, and Feature scaling is done using the ‘standardization’ method

Number of Folds for cross validation is 10.

Algorithms - Linear Regression, Random Forest and Decision Tree are used to train the models
Hyper parameter fine tuning is done using GridSearch

After the above configurations, jupyter code is generated.

After the code is executed, the following results are obtained.

We employed Linear Regression, Decision Tree, and Random Forest algorithms to predict bike demand. After evaluating their performance, the following insights were gathered:

  1. Linear Regression Model:
  • RMSE: 142.45227
  • Mean: 142.50187
  • Std Dev: 4.00133The RMSE is notably high compared to the other models, indicating lower predictive accuracy. Therefore, this model was disregarded for further consideration.
  1. Decision Tree Model:
  • RMSE: 0.59895
  • Mean: 60.47994
  • Stddev: 2.29817While the RMSE is the lowest among the three models, the cross-validation mean is relatively high, suggesting potential overfitting issues. Hence, the suitability of this model for deployment in production is questioned.
  1. Random Forest Regressor Model:
  • RMSE: 15.98372
  • Mean: 43.53210
  • Stddev: 1.70738Although the RMSE is higher compared to the Decision Tree model, the variation in cross-validation mean, albeit indicative of overfitting, appears reasonable. Additionally, the overfitting can be mitigated by incorporating additional data. Hence, this model was selected for fine-tuning.

After fine-tuning the Random Forest Regressor model, the final RMSE improved to 41.13327. If this level of performance is acceptable, the model can proceed to deployment in production. However, if higher accuracy is desired, it is recommended to reassess the quality of the data, consider incorporating additional data, and retrain the model for improved performance.

Data Set - ml/machine_learning/datasets/bike_sharing at master · cloudxlab/ml · GitHub
Selected BootML for this Project

Discarded instant, dteday

Lable was (cnt)

Training set and test set split as 80:20 ratio which was 80 for training and 20 for testing.

Used Median as the imputer

Moved nothing to the Categorical Data.

Used algorithms - Linear Regression, Decision Tree and Random Forest

Linear Regression

RMSE - 142.55471466066638

Mean - 142.60048335718665

Standard Deviation - 3.951846872985428

Decision tree RMSE

RMSE 0.5989453436724405

Mean: 60.389922919513765

Standard deviation: 1.8442561348658812

Random Forest

RMSE 16.070527909994066

Mean: 43.761209948247476

Standard deviation: 1.601296445040747

importance score of each attribute in GridSearchCV

What i analysed ?

Best performance is from Random Tree algorithm, which got fine tuned to get the final RMSE.

Random tree has the lowest MEAN and Standard deviation compared to the Linear Regression and Decision Tree

  • Objective:
  • The goal is to predict bike rental demand (cnt) using historical data and evaluate the model’s performance using RMSE.
  • Dataset:
  • Used bikes.csv from the BOOTML processor.
  • Features included weather conditions, temporal data, and user behavior metrics.
  • Data Cleaning:
  • Dropped Index and Date columns as they did not add predictive value.
  • Checked for missing values and applied median imputation.
  • Feature Engineering:
  • Retained all other columns as numerical features.
  • No additional transformations or encoding were performed for categorical variables.
  • Target Variable:
  • Chose cnt (total rental count) as the label since it represents the desired prediction.
  • Data Splitting:
  • Split the dataset into 80% training and 20% testing data.
  • Used a random seed (42) for reproducibility.
  • Preprocessing:
  • Standardized all numerical features using StandardScaler.
  • Model Selection:
  • Evaluated the following algorithms:
    • Linear Regression: Baseline model to test linear relationships.
    • Decision Tree: To capture non-linear patterns.
    • Random Forest: To address overfitting issues with Decision Trees.
  • Fine-tuned Random Forest using GridSearchCV.
  • Performance Metric:
  • Selected RMSE as the primary metric for evaluating prediction accuracy.

the results were showed overfitting for most so used Randomforest to balance

  • Results:
  • Final Model: Random Forest with fine-tuned parameters.
  • Test RMSE: 3.71.

For the exercise to predict bike usage I Used Boot ML with existing Bikes Assessment Project.

Supervised Learning using regression measured by Mean Squared Error.

Dataset used in repository “Bikes Data”. Dataset File bikes.csv

Discarded columns by default in existing project were:

Instant

Dteday

Atemp

Casual

Registered

Label as data we want to predict is “cnt”

Training and test set as 80:20 with 42 as random seed

No Categorical fields were identified on remaining columns on the data

Feature scaling using Standardization

Correlations:

Temp (0.4) and hr (0.39 are the highest positive correlations and hum (-0.32) and weathersit (-0.14) with the highest negative correlations with cnt.

Models:

Linear Regression Model

RMSE 142.55

Cross Validation

Mean: 142.60048335718665

Standard deviation: 3.951846872985428

Decision Tree

RMSE:0.59

Cross Validation

Mean: 60.389922919513765

Standard deviation: 1.8442561348658812

Random Forests

RMSE: 16.07

Cross Validation

Mean: 43.761209948247476

Standard deviation: 1.601296445040747

Based on results Random Forests seems to be the most efficient model with Lower RMSE and cross validation in comparison to other models, plus a low Standard deviation.

In response to Predict the bike demand in future task:

Following the checklist sequence of approaching ML projects, I used BootML to build, select and fine tune the model as below:

  1. I looked at the big picture and thought what could be factors affecting bike demand and concluded that temperature, rain, etc climate conditions can be as well as price, traffic and road safety.
  2. I gathered the data and saw that it has features which seemed relevant such as season and temperature and also field which seemed not relevant such as record index, etc.
  3. I explored the data , and I confirmed my assumptions that there is a positive correlation with temperature and season and found out negative correlation with humidity
  4. I prepared the data for ML by replacing the missing values with median value, stratified the data based on season and performed feature scaling.
  5. I built 3 ML models, using Linear Regression, Decision Tree and Random Forest algorithms respectively. I received the following results for RMSE and cross-validation mean RMSE:

Linear Regression: RMSE 142; Cross Valid RMSE: Mean 142
Decision Tree: RMSE 0.59; Cross Valid RMSE: Mean 60
Random Forest: RMSE 16; Cross Valid RMSE: Mean 43.8

From these results, I concluded that the Linear Regression model didn’t perform well as it had high RMSE for both the initial set and the cross validation folds. The Decision Tree algorithm showed very significant difference which indicates overfitting of the model as it failed to generalize well during the cross validation. I evaluated and selected the Random Forest as the best model from the 3 in this case.
6) BootMl has also selected this model for fine-tuning and has performed hyperparameters tuning through Grid Search algorithm as I’ve selected previously.
7) The model showed that the feature importance score for prediction is as follows:

See the importance score of each attribute in GridSearchCV

feature_importances = grid_search.best_estimator_.feature_importances_

sorted(zip(feature_importances, attributes), reverse=True)

[(0.5940383211759476, ‘hr’),

(0.1314648370277314, ‘temp’),

(0.08115055651157858, ‘yr’),

(0.054600523654828356, ‘workingday’),

(0.038212596573476344, ‘hum’),

(0.024857598473005206, ‘season’),

(0.023493412411162557, ‘weekday’),

(0.02033753184618983, ‘mnth’),

(0.016669847446518997, ‘weathersit’),

(0.012790932124135795, ‘windspeed’),

(0.0023838427554254754, ‘holiday’)]

which indicates that the hr and temp are the highest contributors to affecting demand.
The final RMSE after tuning was 41.43

  1. If this RMSE is satisfactory, the model could be deployed. Otherwise further data can be added and the models re-run or other algorithms can be used with the same data in efforts to build a better performing model.

RMSE in Linear Regression Model - 140.43520202989316
Cross Validation for Linear Regression Mean: 180287643610.20673

RMSE in Decision Tree model - 0.453
Cross Validation for Decision Tree Model Mean: 61.3380

RMSE in Random Forest model - 16.234
Cross Validation in Random Forest model Mean: 44.18

final_rmse 51.509

Best Model: Random Forest model

After training the model, we evaluated its performance using RMSE (Root Mean Squared Error). The RMSE for our Linear Regression model was calculated, highlighting its predictive accuracy. Key learnings from this project include effective data preprocessing, feature selection, and handling categorical and numerical attributes using pipelines. We also explored feature scaling techniques such as standardization. Cross-validation helped assess model reliability. Insights from correlation analysis and scatter plots guided feature engineering. Future improvements could include hyperparameter tuning and testing alternative models like Decision Trees or Random Forest for better accuracy. This hands-on approach strengthened our understanding of end-to-end ML workflows.

Correlation:
season 0.180080
yr 0.249280
mnth 0.124786
hr 0.395162
holiday -0.031149
weekday 0.027396
workingday 0.031818
weathersit -0.140356
temp 0.400696
hum -0.320436
windspeed 0.087335
cnt 1.000000

Linear Regression:
RMSE: 140.69
CV RMSE: 140.92

Decision Tree:
RMSE: 0.598
CV RMSE: 60.30
Random Forest:
RMSE: 15.964
CV RMSE:  43.51

Final RMSE: 49.57

Features:
yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed

Label:
cnt

Used stratified sampling on season

Used Standardization for feature scaling

Selected Linear Regression, random forest, Decision Tree algorithms

RMSE:
Decision Tree - 0.5989453436724405
Random Forest Model - 16.136436478906223
Linear Regression Model - 140.42528931467888
Cross Validation Mean RMSE:
Decision Tree - 61.75452780128436
Random Forest Model - 43.85423021350333
Linear Regression Model - 140.66534762287745

Random forest is chosen as best model
Final RMSE : 52.01065305006697

Aim is to predict the bike demand in the future

BootML was used to build the model.

Observation:

Linear Regression:

RMSE: 749.263

Cross Validation Mean: 791.954

Cross Validation SD: 115.856

Decision Tree:

RMSE: 0.0

Cross Validation Mean: 967.165

Cross Validation SD: 195.063

Random Forest:

RMSE: 259.807

Cross Validation Mean: 724.681

Cross Validation SD: 150.572

Final RMSE : 719.986

Conclusion

    1. Decision tree is overfitting.
    1. Based on the RMSE value random forest is the best fit. After fine tuning the final RMSE is 719.986.

Objective :: Predict the bike demand in the future

I used the BootML to build, select and fine tune the model using the day.csv file for training. Highlighting the various details pertaining to the activity below :-

    1. Discarded the following columns instant, dteday, atemp, casual & registered
    1. Used cnt as label and used the rest as features
    1. The data was split into training and test in the ratio 80:20
    1. Kept the following fields as categorical fields – season, yr, mnth, holiday, weekday, workingday & weathersit
    1. Used feature scaling on the following features temp, hum & windspeed using standardization method
    1. The selected algorithms are Linear Regression, Random Forest & Decision Tree
    1. Hyper-paremeter tuning code using grid search was enabled

Observation:

Linear Regression:

RMSE: 749.688

Cross Validation Mean: 788.058

Cross Validation SD: 119.255

Decision Tree:

RMSE: 0.0

Cross Validation Mean: 999

Cross Validation SD: 193.095

Random Forest:

RMSE: 259.848

Cross Validation Mean: 730.428

Cross Validation SD: 161.052

Final RMSE : 687.968

Findings :

    1. Random Forest has the lowest RMSE
    1. Cross-validation mean for Random Forest is lower than Linear Regression suggesting better generalization
    1. Decision tree is overfitting
    1. Based on the RMSE Random forest model is a better model and chosen for fine tuning with a final RMSE of 687.968

Bike Demand Prediction – Project used Bootml

  1. Objective
    To build a machine learning model that predicts the number of rented bikes (cnt) for a given day using historical features such as weather conditions, date-related parameters, and other contextual data.

  2. Dataset Overview
    You’re using the .csv(input) file which includes both numerical and categorical features. The target variable is:
    • cnt: total count of total rental bikes including both casual and registered.
    You can drop or ignore:
    • instant (just an index)
    • dteday (used to derive date parts if needed)
    • casual and registered (they sum up to cnt, so they leak the target)
    • atemp

  3. Features to Use
    • Categorical: season, yr, mnth, holiday, weekday, workingday, weathersit
    • Numerical: temp, atemp, hum, windspeed,hr

  4. Modeling Pipeline (using BootML)
    Step-by-step:
    1 Import Data: Upload or select
    2 Preprocessing:
    o Drop irrelevant columns: instant, dteday, casual, registered, atemp
    o Encode categorical variables (Label Encoding)
    o Scale numerical features (Min-Max Scaling or Standardization)
    3 Split Data: Train-Test Split (e.g., 80-20)
    4 Model Selection:
    o Try multiple algorithms: Linear Regression, Decision Trees, Random Forest,
    o Evaluate using RMSE (Root Mean Squared Error)
    5 Hyperparameter Tuning : Grid
    6 Model Evaluation:
    o Predict on test set
    o Calculate RMSE

  5. RMSE and Evaluation
    Once you’ve trained the model, document:
    • RMSE value
    • Which algorithm performed best

  6. Learnings & Observations
    • Feature Impact: Hours(hr)Temperature (temp) were among the most important features influencing demand.
    • [(0.5155808715451919, ‘hr’),
    • (0.12356670238434536, ‘temp’),
    • (0.08205459307754168, ‘hum’),
    • (0.04248235547022842, ‘weekday’),
    • (0.03417051461685536, ‘windspeed’),
    • (0.031871171228578375, ‘workingday’),
    • (0.02814935635160167, ‘season’),
    • (0.006250378546369198, ‘holiday’),
    • (0.005733652331872402, ‘weathersit’),
    • (0.004468366144323507, ‘mnth’),
    • (0.003447713284487389, ‘yr’)]
    • Seasonality: Demand varies significantly with season and month, as expected.
    • Model Choice: Random Forest performed better than Linear Regression due to its ability to handle non-linearity and feature interactions.
    • RMSE Achieved: RMSE = 51.478727
    • Challenge: correctly encoding categorical variables.

Project Name => Predict Bike Demand in Future

Tool => BootML

Model => supervised learning with regression algorithms to build the model. Performance measure is mean squared error.

DataSet => created Dataset bikeDataSet and upload day.csv data file, types is CSV. It is located at “/home/ranjitpandey834075/BootML/Datasets/bikeDataSet_1/”

Data Clean up- Column Discarded => instant, dteday, atemp, casual and registered

Label=> cnt field

The Dataset set is split into training set and test set in the ratio 80:20 and random seed= 42

Stratified sampling => weathersit

Categorical Field => mnth, yr

Used all three algorithms => Linear Regression, Random Forest and Decision Tree are chosen to calculate RMSE. Then by fine tuning hyperparameters and selecting Grid Search the Final Model has been build.

bikeDataSet_prepared.shape => (584, 22)

Train a Linear Regression model =>

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Calculate the RMSE in Linear Regression Model => 768.5944724413282

K-fold Cross Validation for Linear Regression =>

Scores: [696.03321432 750.45535426 960.59584706 695.0034799 981.00959193

830.59905649 825.37748752 975.82091261 640.40318235 607.21225263]

Mean: 796.251037906218

Standard deviation: 133.3546238995723

K-fold Cross Validation for Decision Tree Model =>

Scores: [ 763.21421122 889.43738645 1014.79785914 901.05760835 1306.80820691

857.39525348 1036.17299015 1255.78573425 1175.76949858 848.94624337]

Mean: 1004.9384991896061

Standard deviation: 176.909767307545

Calculate RMSE in Random Forest model => 256.55570549443365

Cross Validation in Random Forest model =>

Scores: [ 524.46036498 663.81704947 848.49415914 571.27245884 1077.40721374

676.34221889 695.49285665 890.15660905 627.14292476 645.30993013]

Mean: 721.9895785645234

Standard deviation: 159.18044852283458

the score of each hyperparameter combination tested during the grid search =>

714.9833781202965 {‘max_features’: 6, ‘n_estimators’: 10}

687.7435814930792 {‘max_features’: 6, ‘n_estimators’: 30}

724.7979163572528 {‘max_features’: 8, ‘n_estimators’: 10}

687.4134063083247 {‘max_features’: 8, ‘n_estimators’: 30}

686.3799212049184 {‘bootstrap’: False, ‘max_features’: 7, ‘n_estimators’: 20}

677.6198719863118 {‘bootstrap’: False, ‘max_features’: 7, ‘n_estimators’: 35}

717.6562284609751 {‘bootstrap’: False, ‘max_features’: 9, ‘n_estimators’: 20}

707.9118750457284 {‘bootstrap’: False, ‘max_features’: 9, ‘n_estimators’: 35}

score of each attribute in GridSearchCV =>

[(0.2902867422499126, ‘temp’),

(0.19624915015754796, ‘yr’),

(0.17332577979020708, ‘season’),

(0.09894612274579834, ‘mnth’),

(0.05776152697143375, ‘hum’),

(0.037425915375431, ‘windspeed’),

(0.03600668156127903, ‘weathersit’),

(0.020480467635953403, ‘weekday’),

(0.006266844439114171, ‘workingday’),

(0.0033767127366916465, ‘holiday’)]

final_rmse => 674.170774876122

  • The objective of the project was to develop a machine learning model to estimate daily bike rental demand (cnt ) using historical data.
  • The model was built using Azure Machine Learning (Azure ML) as the development and deployment platform.
  • The dataset used was day.csv , uploaded from the path ml/machine_learning/datasets/bike_sharing/day.csv .
  • The features instant , dteday , casual , and registered were dropped during the preprocessing phase.
  • The cnt column, which represents the total count of bike rentals, was selected as the label for the model.
  • The dataset was split into training and testing sets with an 80:20 ratio.
  • Stratification during the data split was based on the hum (humidity) feature to ensure a balanced distribution.
  • Two machine learning models were trained and evaluated: Linear Regression and Decision Tree Regression.
  • The Linear Regression model achieved a Root Mean Squared Error (RMSE) of 1483.99.
  • The Decision Tree Regression model performed better with a lower RMSE of 1369.04.
  • RMSE was used as the evaluation metric to measure model performance.
  • The Decision Tree model’s better performance suggests that the relationship between features and demand is non-linear.
  • Important features influencing demand include season, temperature, humidity, working day status, and weather conditions.
  • Future improvements can include hyperparameter tuning, advanced models like Random Forest or XGBoost, and use of temporal feature engineering.
  • Using the hour.csv dataset can help build more granular models that capture hourly demand fluctuations.

I used BootML for predicting the bike demand in the future.

This is a supervised learning project which is solved using the regression tehnique.

The performance measure selected is mean squared error.

The columns I discarded were instant, dteday (since they were similar to indexes) and casual, registered (as they are already included in count).

The label is cnt.

I used stratified samling on season.

I selected categorical variables as: season, year, month, holiday, workingday, weathersit and the rest as numerical variables.

Then applied standardization for feature scaling.

Selected three algorithms to train the model: linear regression, random forest and decision tree.

Tuned hyperparameters using grid search.

Rmse for linear regression model : 752.3000677817149 and after crossvalidation it is : 796.9627183639338

Rmse for decision tree model : 0.0 (this model is overfitting) and after cross validation it is : 959.366275671601

Rmse for random forest model : 262.31828420140647 and after cross validation it is : 711.5148130640334

hence the accurate model here would be random forest.

so after tuning the hyperparameters the final Rmse for random forest model is 693.9363282988497

Project Env: BootML

Hyperlink: BootML

Project Name: PredictBikeDemandInFuture

Type of Project: Supervised Learning

Type of Supervised Learning: ‘Regression’ as it about predicting bike sales numbers and not discrete prediction like a ‘yes’ or a ‘no’

Performance Measure Selected: “Mean Squared Error” as MSE is one of the best performance measures available for “Supervised Learning” of type regression.

Selecting Dataset file: Dataset Selection → Create your own machine learning data set – Uploading “day.csv”

Discard Irrelevant Fields: Discarding column ‘instant’ as it does not assist with prediction directly.

Features & Labels: Column ‘cnt’ is the label and the rest are features.

Split data: Splitting data into 80(training) & 20(test) set and “Random Seed” = 42 & not using stratified sampling

Visualize data: Scatter Plot: X-Axis – ‘mnth’ column & Y-Axis – ‘cnt’ column, this will help visualize monthly bike sales distribution. Transparency = 0.1 to manage visibility of overlapped data. Also select “Generate Correlations” & “Generate Scatter-Matrix”.

Categorical Field Selection: “Fix missing values using” - ‘imputer’ with strategy ‘median’ though there are no missing values in the data. Also, most of the fields are numerical excepting the entire date field ‘dteday’ hence considering it as categorical.

Feature Scaling: Moving only ‘casual’ & ‘registered’ columns to ‘Standardization’ as most of the columns are normalized hence belong to the category of “Min-max”.

Algorithm Selection: Maintaining the number of folds for cross validations as ‘10’ and select all 3 algorithms, i.e. “Linear Regression”, “Random Forest” & “Decision Tree”. For “Hyper-parameter Tuning” using “Grid search” since the range of values is not that high as seen from the available data set.

Analyzing the jupyter notebook generated:

Running on 15 columns (excluded the instant column) and 731 rows.

Count of bikes sold has positive correlation with “count of registered users”, “count of casual users”, temperature (temp, atemp), etc.

Count of bikes sold has a negative correlation with ‘weathersit’, ‘windspeed’, etc.

forest_rmse = 49.725 & tree_rmse = 0 (overfitting??) & lin_rmse = 1.923

Cross Validation Mean RMSE (Random Forest) = 130.994 (high)

Cross Validation Mean RMSE (Decision Tree) = 248.507

Cross Validation Mean RMSE (Linear Regression) = 36.283 (since cross validation mean RMSE is much higher than RMSE, i.e. more than 18 times hence it means its overfitting the training set, so either regularize or get more training data.)

Suggestive Inference: Get more training data, regularize or nearest fit is “Random Forest” algorithm