Predict the bike demand in future

Correlation:
season 0.180080
yr 0.249280
mnth 0.124786
hr 0.395162
holiday -0.031149
weekday 0.027396
workingday 0.031818
weathersit -0.140356
temp 0.400696
hum -0.320436
windspeed 0.087335
cnt 1.000000

Linear Regression:
RMSE: 140.69
CV RMSE: 140.92

Decision Tree:
RMSE: 0.598
CV RMSE: 60.30
Random Forest:
RMSE: 15.964
CV RMSE:  43.51

Final RMSE: 49.57

Features:
yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed

Label:
cnt

Used stratified sampling on season

Used Standardization for feature scaling

Selected Linear Regression, random forest, Decision Tree algorithms

RMSE:
Decision Tree - 0.5989453436724405
Random Forest Model - 16.136436478906223
Linear Regression Model - 140.42528931467888
Cross Validation Mean RMSE:
Decision Tree - 61.75452780128436
Random Forest Model - 43.85423021350333
Linear Regression Model - 140.66534762287745

Random forest is chosen as best model
Final RMSE : 52.01065305006697

Aim is to predict the bike demand in the future

BootML was used to build the model.

Observation:

Linear Regression:

RMSE: 749.263

Cross Validation Mean: 791.954

Cross Validation SD: 115.856

Decision Tree:

RMSE: 0.0

Cross Validation Mean: 967.165

Cross Validation SD: 195.063

Random Forest:

RMSE: 259.807

Cross Validation Mean: 724.681

Cross Validation SD: 150.572

Final RMSE : 719.986

Conclusion

    1. Decision tree is overfitting.
    1. Based on the RMSE value random forest is the best fit. After fine tuning the final RMSE is 719.986.

Objective :: Predict the bike demand in the future

I used the BootML to build, select and fine tune the model using the day.csv file for training. Highlighting the various details pertaining to the activity below :-

    1. Discarded the following columns instant, dteday, atemp, casual & registered
    1. Used cnt as label and used the rest as features
    1. The data was split into training and test in the ratio 80:20
    1. Kept the following fields as categorical fields – season, yr, mnth, holiday, weekday, workingday & weathersit
    1. Used feature scaling on the following features temp, hum & windspeed using standardization method
    1. The selected algorithms are Linear Regression, Random Forest & Decision Tree
    1. Hyper-paremeter tuning code using grid search was enabled

Observation:

Linear Regression:

RMSE: 749.688

Cross Validation Mean: 788.058

Cross Validation SD: 119.255

Decision Tree:

RMSE: 0.0

Cross Validation Mean: 999

Cross Validation SD: 193.095

Random Forest:

RMSE: 259.848

Cross Validation Mean: 730.428

Cross Validation SD: 161.052

Final RMSE : 687.968

Findings :

    1. Random Forest has the lowest RMSE
    1. Cross-validation mean for Random Forest is lower than Linear Regression suggesting better generalization
    1. Decision tree is overfitting
    1. Based on the RMSE Random forest model is a better model and chosen for fine tuning with a final RMSE of 687.968

Bike Demand Prediction – Project used Bootml

  1. Objective
    To build a machine learning model that predicts the number of rented bikes (cnt) for a given day using historical features such as weather conditions, date-related parameters, and other contextual data.

  2. Dataset Overview
    You’re using the .csv(input) file which includes both numerical and categorical features. The target variable is:
    • cnt: total count of total rental bikes including both casual and registered.
    You can drop or ignore:
    • instant (just an index)
    • dteday (used to derive date parts if needed)
    • casual and registered (they sum up to cnt, so they leak the target)
    • atemp

  3. Features to Use
    • Categorical: season, yr, mnth, holiday, weekday, workingday, weathersit
    • Numerical: temp, atemp, hum, windspeed,hr

  4. Modeling Pipeline (using BootML)
    Step-by-step:
    1 Import Data: Upload or select
    2 Preprocessing:
    o Drop irrelevant columns: instant, dteday, casual, registered, atemp
    o Encode categorical variables (Label Encoding)
    o Scale numerical features (Min-Max Scaling or Standardization)
    3 Split Data: Train-Test Split (e.g., 80-20)
    4 Model Selection:
    o Try multiple algorithms: Linear Regression, Decision Trees, Random Forest,
    o Evaluate using RMSE (Root Mean Squared Error)
    5 Hyperparameter Tuning : Grid
    6 Model Evaluation:
    o Predict on test set
    o Calculate RMSE

  5. RMSE and Evaluation
    Once you’ve trained the model, document:
    • RMSE value
    • Which algorithm performed best

  6. Learnings & Observations
    • Feature Impact: Hours(hr)Temperature (temp) were among the most important features influencing demand.
    • [(0.5155808715451919, ‘hr’),
    • (0.12356670238434536, ‘temp’),
    • (0.08205459307754168, ‘hum’),
    • (0.04248235547022842, ‘weekday’),
    • (0.03417051461685536, ‘windspeed’),
    • (0.031871171228578375, ‘workingday’),
    • (0.02814935635160167, ‘season’),
    • (0.006250378546369198, ‘holiday’),
    • (0.005733652331872402, ‘weathersit’),
    • (0.004468366144323507, ‘mnth’),
    • (0.003447713284487389, ‘yr’)]
    • Seasonality: Demand varies significantly with season and month, as expected.
    • Model Choice: Random Forest performed better than Linear Regression due to its ability to handle non-linearity and feature interactions.
    • RMSE Achieved: RMSE = 51.478727
    • Challenge: correctly encoding categorical variables.

Project Name => Predict Bike Demand in Future

Tool => BootML

Model => supervised learning with regression algorithms to build the model. Performance measure is mean squared error.

DataSet => created Dataset bikeDataSet and upload day.csv data file, types is CSV. It is located at “/home/ranjitpandey834075/BootML/Datasets/bikeDataSet_1/”

Data Clean up- Column Discarded => instant, dteday, atemp, casual and registered

Label=> cnt field

The Dataset set is split into training set and test set in the ratio 80:20 and random seed= 42

Stratified sampling => weathersit

Categorical Field => mnth, yr

Used all three algorithms => Linear Regression, Random Forest and Decision Tree are chosen to calculate RMSE. Then by fine tuning hyperparameters and selecting Grid Search the Final Model has been build.

bikeDataSet_prepared.shape => (584, 22)

Train a Linear Regression model =>

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Calculate the RMSE in Linear Regression Model => 768.5944724413282

K-fold Cross Validation for Linear Regression =>

Scores: [696.03321432 750.45535426 960.59584706 695.0034799 981.00959193

830.59905649 825.37748752 975.82091261 640.40318235 607.21225263]

Mean: 796.251037906218

Standard deviation: 133.3546238995723

K-fold Cross Validation for Decision Tree Model =>

Scores: [ 763.21421122 889.43738645 1014.79785914 901.05760835 1306.80820691

857.39525348 1036.17299015 1255.78573425 1175.76949858 848.94624337]

Mean: 1004.9384991896061

Standard deviation: 176.909767307545

Calculate RMSE in Random Forest model => 256.55570549443365

Cross Validation in Random Forest model =>

Scores: [ 524.46036498 663.81704947 848.49415914 571.27245884 1077.40721374

676.34221889 695.49285665 890.15660905 627.14292476 645.30993013]

Mean: 721.9895785645234

Standard deviation: 159.18044852283458

the score of each hyperparameter combination tested during the grid search =>

714.9833781202965 {‘max_features’: 6, ‘n_estimators’: 10}

687.7435814930792 {‘max_features’: 6, ‘n_estimators’: 30}

724.7979163572528 {‘max_features’: 8, ‘n_estimators’: 10}

687.4134063083247 {‘max_features’: 8, ‘n_estimators’: 30}

686.3799212049184 {‘bootstrap’: False, ‘max_features’: 7, ‘n_estimators’: 20}

677.6198719863118 {‘bootstrap’: False, ‘max_features’: 7, ‘n_estimators’: 35}

717.6562284609751 {‘bootstrap’: False, ‘max_features’: 9, ‘n_estimators’: 20}

707.9118750457284 {‘bootstrap’: False, ‘max_features’: 9, ‘n_estimators’: 35}

score of each attribute in GridSearchCV =>

[(0.2902867422499126, ‘temp’),

(0.19624915015754796, ‘yr’),

(0.17332577979020708, ‘season’),

(0.09894612274579834, ‘mnth’),

(0.05776152697143375, ‘hum’),

(0.037425915375431, ‘windspeed’),

(0.03600668156127903, ‘weathersit’),

(0.020480467635953403, ‘weekday’),

(0.006266844439114171, ‘workingday’),

(0.0033767127366916465, ‘holiday’)]

final_rmse => 674.170774876122

  • The objective of the project was to develop a machine learning model to estimate daily bike rental demand (cnt ) using historical data.
  • The model was built using Azure Machine Learning (Azure ML) as the development and deployment platform.
  • The dataset used was day.csv , uploaded from the path ml/machine_learning/datasets/bike_sharing/day.csv .
  • The features instant , dteday , casual , and registered were dropped during the preprocessing phase.
  • The cnt column, which represents the total count of bike rentals, was selected as the label for the model.
  • The dataset was split into training and testing sets with an 80:20 ratio.
  • Stratification during the data split was based on the hum (humidity) feature to ensure a balanced distribution.
  • Two machine learning models were trained and evaluated: Linear Regression and Decision Tree Regression.
  • The Linear Regression model achieved a Root Mean Squared Error (RMSE) of 1483.99.
  • The Decision Tree Regression model performed better with a lower RMSE of 1369.04.
  • RMSE was used as the evaluation metric to measure model performance.
  • The Decision Tree model’s better performance suggests that the relationship between features and demand is non-linear.
  • Important features influencing demand include season, temperature, humidity, working day status, and weather conditions.
  • Future improvements can include hyperparameter tuning, advanced models like Random Forest or XGBoost, and use of temporal feature engineering.
  • Using the hour.csv dataset can help build more granular models that capture hourly demand fluctuations.

I used BootML for predicting the bike demand in the future.

This is a supervised learning project which is solved using the regression tehnique.

The performance measure selected is mean squared error.

The columns I discarded were instant, dteday (since they were similar to indexes) and casual, registered (as they are already included in count).

The label is cnt.

I used stratified samling on season.

I selected categorical variables as: season, year, month, holiday, workingday, weathersit and the rest as numerical variables.

Then applied standardization for feature scaling.

Selected three algorithms to train the model: linear regression, random forest and decision tree.

Tuned hyperparameters using grid search.

Rmse for linear regression model : 752.3000677817149 and after crossvalidation it is : 796.9627183639338

Rmse for decision tree model : 0.0 (this model is overfitting) and after cross validation it is : 959.366275671601

Rmse for random forest model : 262.31828420140647 and after cross validation it is : 711.5148130640334

hence the accurate model here would be random forest.

so after tuning the hyperparameters the final Rmse for random forest model is 693.9363282988497

Project Env: BootML

Hyperlink: BootML

Project Name: PredictBikeDemandInFuture

Type of Project: Supervised Learning

Type of Supervised Learning: ‘Regression’ as it about predicting bike sales numbers and not discrete prediction like a ‘yes’ or a ‘no’

Performance Measure Selected: “Mean Squared Error” as MSE is one of the best performance measures available for “Supervised Learning” of type regression.

Selecting Dataset file: Dataset Selection → Create your own machine learning data set – Uploading “day.csv”

Discard Irrelevant Fields: Discarding column ‘instant’ as it does not assist with prediction directly.

Features & Labels: Column ‘cnt’ is the label and the rest are features.

Split data: Splitting data into 80(training) & 20(test) set and “Random Seed” = 42 & not using stratified sampling

Visualize data: Scatter Plot: X-Axis – ‘mnth’ column & Y-Axis – ‘cnt’ column, this will help visualize monthly bike sales distribution. Transparency = 0.1 to manage visibility of overlapped data. Also select “Generate Correlations” & “Generate Scatter-Matrix”.

Categorical Field Selection: “Fix missing values using” - ‘imputer’ with strategy ‘median’ though there are no missing values in the data. Also, most of the fields are numerical excepting the entire date field ‘dteday’ hence considering it as categorical.

Feature Scaling: Moving only ‘casual’ & ‘registered’ columns to ‘Standardization’ as most of the columns are normalized hence belong to the category of “Min-max”.

Algorithm Selection: Maintaining the number of folds for cross validations as ‘10’ and select all 3 algorithms, i.e. “Linear Regression”, “Random Forest” & “Decision Tree”. For “Hyper-parameter Tuning” using “Grid search” since the range of values is not that high as seen from the available data set.

Analyzing the jupyter notebook generated:

Running on 15 columns (excluded the instant column) and 731 rows.

Count of bikes sold has positive correlation with “count of registered users”, “count of casual users”, temperature (temp, atemp), etc.

Count of bikes sold has a negative correlation with ‘weathersit’, ‘windspeed’, etc.

forest_rmse = 49.725 & tree_rmse = 0 (overfitting??) & lin_rmse = 1.923

Cross Validation Mean RMSE (Random Forest) = 130.994 (high)

Cross Validation Mean RMSE (Decision Tree) = 248.507

Cross Validation Mean RMSE (Linear Regression) = 36.283 (since cross validation mean RMSE is much higher than RMSE, i.e. more than 18 times hence it means its overfitting the training set, so either regularize or get more training data.)

Suggestive Inference: Get more training data, regularize or nearest fit is “Random Forest” algorithm

RMSE for models:
Linear Regression - 142.55471466066635
Decision Tree - 0.5989453436724405
Random Forest - 16.070527909994066

Decision Tree has the lowest RMSE. However cross after cross validation RMSE are
Linear Regression - 142.60048335718665
Decision Tree - 60.389922919513765
Random Forest - 43.761209948247476

That means Decision Tree model is overfitting. Random Forest Model has the lowest RMSE after cross validation and hence best out of the 3.