Predict the bike demand in future

Correlation:
season 0.180080
yr 0.249280
mnth 0.124786
hr 0.395162
holiday -0.031149
weekday 0.027396
workingday 0.031818
weathersit -0.140356
temp 0.400696
hum -0.320436
windspeed 0.087335
cnt 1.000000

Linear Regression:
RMSE: 140.69
CV RMSE: 140.92

Decision Tree:
RMSE: 0.598
CV RMSE: 60.30
Random Forest:
RMSE: 15.964
CV RMSE:  43.51

Final RMSE: 49.57

Features:
yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed

Label:
cnt

Used stratified sampling on season

Used Standardization for feature scaling

Selected Linear Regression, random forest, Decision Tree algorithms

RMSE:
Decision Tree - 0.5989453436724405
Random Forest Model - 16.136436478906223
Linear Regression Model - 140.42528931467888
Cross Validation Mean RMSE:
Decision Tree - 61.75452780128436
Random Forest Model - 43.85423021350333
Linear Regression Model - 140.66534762287745

Random forest is chosen as best model
Final RMSE : 52.01065305006697

Aim is to predict the bike demand in the future

BootML was used to build the model.

Observation:

Linear Regression:

RMSE: 749.263

Cross Validation Mean: 791.954

Cross Validation SD: 115.856

Decision Tree:

RMSE: 0.0

Cross Validation Mean: 967.165

Cross Validation SD: 195.063

Random Forest:

RMSE: 259.807

Cross Validation Mean: 724.681

Cross Validation SD: 150.572

Final RMSE : 719.986

Conclusion

    1. Decision tree is overfitting.
    1. Based on the RMSE value random forest is the best fit. After fine tuning the final RMSE is 719.986.

Objective :: Predict the bike demand in the future

I used the BootML to build, select and fine tune the model using the day.csv file for training. Highlighting the various details pertaining to the activity below :-

    1. Discarded the following columns instant, dteday, atemp, casual & registered
    1. Used cnt as label and used the rest as features
    1. The data was split into training and test in the ratio 80:20
    1. Kept the following fields as categorical fields – season, yr, mnth, holiday, weekday, workingday & weathersit
    1. Used feature scaling on the following features temp, hum & windspeed using standardization method
    1. The selected algorithms are Linear Regression, Random Forest & Decision Tree
    1. Hyper-paremeter tuning code using grid search was enabled

Observation:

Linear Regression:

RMSE: 749.688

Cross Validation Mean: 788.058

Cross Validation SD: 119.255

Decision Tree:

RMSE: 0.0

Cross Validation Mean: 999

Cross Validation SD: 193.095

Random Forest:

RMSE: 259.848

Cross Validation Mean: 730.428

Cross Validation SD: 161.052

Final RMSE : 687.968

Findings :

    1. Random Forest has the lowest RMSE
    1. Cross-validation mean for Random Forest is lower than Linear Regression suggesting better generalization
    1. Decision tree is overfitting
    1. Based on the RMSE Random forest model is a better model and chosen for fine tuning with a final RMSE of 687.968

Bike Demand Prediction – Project used Bootml

  1. Objective
    To build a machine learning model that predicts the number of rented bikes (cnt) for a given day using historical features such as weather conditions, date-related parameters, and other contextual data.

  2. Dataset Overview
    You’re using the .csv(input) file which includes both numerical and categorical features. The target variable is:
    • cnt: total count of total rental bikes including both casual and registered.
    You can drop or ignore:
    • instant (just an index)
    • dteday (used to derive date parts if needed)
    • casual and registered (they sum up to cnt, so they leak the target)
    • atemp

  3. Features to Use
    • Categorical: season, yr, mnth, holiday, weekday, workingday, weathersit
    • Numerical: temp, atemp, hum, windspeed,hr

  4. Modeling Pipeline (using BootML)
    Step-by-step:
    1 Import Data: Upload or select
    2 Preprocessing:
    o Drop irrelevant columns: instant, dteday, casual, registered, atemp
    o Encode categorical variables (Label Encoding)
    o Scale numerical features (Min-Max Scaling or Standardization)
    3 Split Data: Train-Test Split (e.g., 80-20)
    4 Model Selection:
    o Try multiple algorithms: Linear Regression, Decision Trees, Random Forest,
    o Evaluate using RMSE (Root Mean Squared Error)
    5 Hyperparameter Tuning : Grid
    6 Model Evaluation:
    o Predict on test set
    o Calculate RMSE

  5. RMSE and Evaluation
    Once you’ve trained the model, document:
    • RMSE value
    • Which algorithm performed best

  6. Learnings & Observations
    • Feature Impact: Hours(hr)Temperature (temp) were among the most important features influencing demand.
    • [(0.5155808715451919, ‘hr’),
    • (0.12356670238434536, ‘temp’),
    • (0.08205459307754168, ‘hum’),
    • (0.04248235547022842, ‘weekday’),
    • (0.03417051461685536, ‘windspeed’),
    • (0.031871171228578375, ‘workingday’),
    • (0.02814935635160167, ‘season’),
    • (0.006250378546369198, ‘holiday’),
    • (0.005733652331872402, ‘weathersit’),
    • (0.004468366144323507, ‘mnth’),
    • (0.003447713284487389, ‘yr’)]
    • Seasonality: Demand varies significantly with season and month, as expected.
    • Model Choice: Random Forest performed better than Linear Regression due to its ability to handle non-linearity and feature interactions.
    • RMSE Achieved: RMSE = 51.478727
    • Challenge: correctly encoding categorical variables.

Project Name => Predict Bike Demand in Future

Tool => BootML

Model => supervised learning with regression algorithms to build the model. Performance measure is mean squared error.

DataSet => created Dataset bikeDataSet and upload day.csv data file, types is CSV. It is located at “/home/ranjitpandey834075/BootML/Datasets/bikeDataSet_1/”

Data Clean up- Column Discarded => instant, dteday, atemp, casual and registered

Label=> cnt field

The Dataset set is split into training set and test set in the ratio 80:20 and random seed= 42

Stratified sampling => weathersit

Categorical Field => mnth, yr

Used all three algorithms => Linear Regression, Random Forest and Decision Tree are chosen to calculate RMSE. Then by fine tuning hyperparameters and selecting Grid Search the Final Model has been build.

bikeDataSet_prepared.shape => (584, 22)

Train a Linear Regression model =>

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Calculate the RMSE in Linear Regression Model => 768.5944724413282

K-fold Cross Validation for Linear Regression =>

Scores: [696.03321432 750.45535426 960.59584706 695.0034799 981.00959193

830.59905649 825.37748752 975.82091261 640.40318235 607.21225263]

Mean: 796.251037906218

Standard deviation: 133.3546238995723

K-fold Cross Validation for Decision Tree Model =>

Scores: [ 763.21421122 889.43738645 1014.79785914 901.05760835 1306.80820691

857.39525348 1036.17299015 1255.78573425 1175.76949858 848.94624337]

Mean: 1004.9384991896061

Standard deviation: 176.909767307545

Calculate RMSE in Random Forest model => 256.55570549443365

Cross Validation in Random Forest model =>

Scores: [ 524.46036498 663.81704947 848.49415914 571.27245884 1077.40721374

676.34221889 695.49285665 890.15660905 627.14292476 645.30993013]

Mean: 721.9895785645234

Standard deviation: 159.18044852283458

the score of each hyperparameter combination tested during the grid search =>

714.9833781202965 {‘max_features’: 6, ‘n_estimators’: 10}

687.7435814930792 {‘max_features’: 6, ‘n_estimators’: 30}

724.7979163572528 {‘max_features’: 8, ‘n_estimators’: 10}

687.4134063083247 {‘max_features’: 8, ‘n_estimators’: 30}

686.3799212049184 {‘bootstrap’: False, ‘max_features’: 7, ‘n_estimators’: 20}

677.6198719863118 {‘bootstrap’: False, ‘max_features’: 7, ‘n_estimators’: 35}

717.6562284609751 {‘bootstrap’: False, ‘max_features’: 9, ‘n_estimators’: 20}

707.9118750457284 {‘bootstrap’: False, ‘max_features’: 9, ‘n_estimators’: 35}

score of each attribute in GridSearchCV =>

[(0.2902867422499126, ‘temp’),

(0.19624915015754796, ‘yr’),

(0.17332577979020708, ‘season’),

(0.09894612274579834, ‘mnth’),

(0.05776152697143375, ‘hum’),

(0.037425915375431, ‘windspeed’),

(0.03600668156127903, ‘weathersit’),

(0.020480467635953403, ‘weekday’),

(0.006266844439114171, ‘workingday’),

(0.0033767127366916465, ‘holiday’)]

final_rmse => 674.170774876122

  • The objective of the project was to develop a machine learning model to estimate daily bike rental demand (cnt ) using historical data.
  • The model was built using Azure Machine Learning (Azure ML) as the development and deployment platform.
  • The dataset used was day.csv , uploaded from the path ml/machine_learning/datasets/bike_sharing/day.csv .
  • The features instant , dteday , casual , and registered were dropped during the preprocessing phase.
  • The cnt column, which represents the total count of bike rentals, was selected as the label for the model.
  • The dataset was split into training and testing sets with an 80:20 ratio.
  • Stratification during the data split was based on the hum (humidity) feature to ensure a balanced distribution.
  • Two machine learning models were trained and evaluated: Linear Regression and Decision Tree Regression.
  • The Linear Regression model achieved a Root Mean Squared Error (RMSE) of 1483.99.
  • The Decision Tree Regression model performed better with a lower RMSE of 1369.04.
  • RMSE was used as the evaluation metric to measure model performance.
  • The Decision Tree model’s better performance suggests that the relationship between features and demand is non-linear.
  • Important features influencing demand include season, temperature, humidity, working day status, and weather conditions.
  • Future improvements can include hyperparameter tuning, advanced models like Random Forest or XGBoost, and use of temporal feature engineering.
  • Using the hour.csv dataset can help build more granular models that capture hourly demand fluctuations.