Predict the bike demand in future

Ratnesh_Pandey · May 1, 2022, 1:56am

Objective: To predict the bike demand in future, taking into consideration the available historical data.

Process: Supervised Learning using Regression

Label: CNT

Split: Training and Test set is 4: 1

Regression Model:

Linear Regression
- Value: 142.55
- Error: 3.95
- Cross Validation Value: 142.6
Decision Tree
- Value: 0.60
- Error: 1.94
- Cross Validation Value: 60.4
Random Forrest
- Value: 16.07
- Error: 1.60
- Cross Validation Value: 43.74

Feature Scaling: Min_Max

Observation: Overfitting in Decision Tree Model.

Best Estimator: Random Forrest

Final Value: 41.43

Muhammad_Mushhood_Ur · May 1, 2022, 1:56am

This supervised learning regression algorithm model aims to predict the bike demand in future. The project aims to build a model that accurately can predict the bike demands depending upon the previous data provided in csv file.
RMSE is used as measure of performance of model. Data is preprocessed to get rid of irrelevant values and “cnt” label was used with features relevant for model. Data was divided into training and test set and with stratified sampling for weathersit to include all seasons equally. Correlations observed and visualized. All three models i.e linear regression, decision tree and random forest tried and random forest was selected as preferred model with optimal RMSE value among others.

Rajesh_Ghosal · May 4, 2022, 3:20pm

In response to Predict the bike demand in future task:

Following the checklist sequence of approaching ML projects, I used BootML to build, select and fine tune the model as below:

I looked at the big picture and thought what could be factors affecting bike demand like weather, day of the week, temperature, humidity etc. and included all the numeric and categorical variables in our training model like:
‘season’, ‘yr’, ‘mnth’, ‘hr’, ‘holiday’, ‘weekday’, ‘workingday’, ‘weathersit’, ‘temp’, ‘atemp’, ‘hum’, ‘windspeed’, ‘casual’, ‘registered’
I excluded few features which seemed not relevant such as instant, dteday.
I explored the data and I confirmed my assumptions that there is a positive correlation with temperature and season and found out negative correlation with humidity
I prepared the data for ML by replacing the missing values with median value, stratified the data based on cnt, performed feature scaling like MinMax and onehot encoding.
I built 3 ML models, using Linear Regression, Decision Tree and Random Forest algorithms respectively. I received the following results for RMSE and 10K-fold cross-validation mean RMSE and Standard deviation:

Linear Regression: Mean: 2.746549874518094e-13
Standard deviation: 7.264839151676053e-14
Decision Tree: Mean: 4.963098305737335
Standard deviation: 0.4520360574099402
Random Forest: Mean: 2.720417962335797
Standard deviation: 0.5882269421315144

From these results, I concluded that the Linear Regression model didn’t perform well as it had high RMSE. The Decision Tree algorithm showed very significant difference which indicates overfitting of the model as it failed to generalize well during the cross validation. I evaluated and selected the Random Forest as the best model from the 3 in this case.
6) BootMl has also selected this model for fine-tuning and has performed hyperparameters tuning through Grid Search algorithm as I’ve selected previously.
7) The model showed that the feature importance score for prediction is as follows:

[(0.5122454630914193, ‘registered’),
(0.20204306899303562, ‘casual’),
(0.03446798497460259, ‘atemp’),
(0.022371327661383614, ‘hum’),
(0.0223399114385595, ‘temp’),
(0.015778073788105217, ‘holiday’),
(0.015391716766457888, ‘weekday’),
(0.009194736994804135, ‘season’),
(0.004040220539153114, ‘windspeed’),
(0.002167437802887816, ‘hr’),
(0.0009007268616543127, ‘mnth’),
(0.0008471652652526795, ‘yr’),
(0.0008217685569335782, ‘weathersit’),
(0.0005468359549292747, ‘workingday’)]

which indicates that the registered and casual users which contributes the total demand as a whole and temperature, humidity are the highest contributors to affecting total demand.
8) In this case few of the features are almost same in nature, so we can drop few of them which has high auto correlation for building a good model without bias.
The final RMSE after tuning was 15.015772087397394
9) If this RMSE is satisfactory, the model could be deployed. Otherwise further data can be added and the models re-run or other algorithms can be used with the same data in efforts to build a better performing model.

Tushar_Kumar · May 4, 2022, 7:00pm

I have used bootml for project work followed the steps.

Process: Supervised Learning using Regression

Discarded irrelevent arttributes from bike.csv are:

instant, dteday, casual and registered

The relevant attributes in bikes.csv are :

season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed, cnt,

Features are:

season, yr, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed

Lable is:

cnt

Split Train data and Test data in ratio of 80:20

Linear Regression:
RMSE: 142.59348949582764
CV Mean: 142.6810496457205
SD: 3.5551087849018335

Decision Tree:
RMSE: 0.45367266614794205
CV Mean: 60.54662580921443
SD: 2.8581256218893256

Random Forest:
RMSE: 16.182445908017026
CV Mean: 43.98548586547771
SD: 1.988039554058191

Thus Best estimater selected is Random Forest and thus this model is selected for fine-tuning.

After fine tuning the Final RMSE (Random Forest): 40.306552845037864

Sharath_Gajare · June 5, 2022, 1:16pm

Objective: Predict the bike demand in the future by creating a suitable model based on the dataset bikes.csv (Bikes_Data_1)

Used BootML.
Type of project: Supervised Learning
Type of Supervised Learning: Regression
Performance Measure: Mean Squared Error

Discarded features that are not relevant for analysis and prediction:
instant, dteday, casual, registered

Here Label is ‘cnt’ - ie the count of bike demand

Data has been split into two sets - Training and test. The split ratio is 80:20
Random seed used is 42.

Stratified sampling is not used.

All the fields are numerical fields.

Data imputation is done by data cleaning and feature scaling. Missing values are replaced with ‘mean’, and Feature scaling is done using the ‘standardization’ method

Number of Folds for cross validation is 10.

Algorithms - Linear Regression, Random Forest and Decision Tree are used to train the models
Hyper parameter fine tuning is done using GridSearch

After the above configurations, jupyter code is generated.

After the code is executed, the following results are obtained.

Linear Regression Model:
RMSE: 142.45226586527895
Mean: 142.50186756941565
Std Dev: 4.00132932767643

Decision Tree model:
RMSE: 0.5989453436724405
Mean: 60.47993567869496
Stddev: 2.2981704508600087

Random Forest Regressor Model:
RMSE: 15.983715666257496
Mean: 43.532095532879495
Stddev: 1.7073796475529766

Fine-tune the Random forest Regressor model.
Final RMSE after fine-tuning: 41.13326718948123

From the above results, we can infer the following:
In case of Linear Regression model, the RMSE is very high compared to decision tree and Random forest. So, we drop this model.
In case of Decision Tree model, we see though the RMSE is low (the lowest among the 3 models), but the cross-validation mean is high indicating overfitting
In case of Random Forest regressor, even though RMSE is higher than Decision Tree, the cross-validation mean’s variation - though overfitting - but seems to be reasonable when compared to decision-tree model. The overfitting can be reduced by using additional data. So, we select this as the final model for fine-tuning.

The final RMSE for RandomForestRegressor model after fine-tuning is 41.13326718948123

If this performance looks okay, then this model can be deployed in production.

Otherwise, review quality of data, and use additional data and re-train and fine-tune the model for better accuracy.

Sreejith_N_T · September 28, 2022, 8:52am

Project Name - Predict the bike demand in future

Data Set - ml/machine_learning/datasets/bike_sharing at master · cloudxlab/ml · GitHub

Selected BootML for the bike project

Discarded instant, dteday

Label cnt

Training set and test set split as 80:20

Used median as the imputer

All data are numerical. So didn’t move any data to categorical

Used standardization for feature scaling

Used algorithms - Linear Regression, Decision Tree and Random Forest

Bikes_Assessment_1_ntsreejith1762 - Jupyter Notebook (cloudxlab.com)

season 0.184377

yr 0.255502

mnth 0.127409

hr 0.391871

holiday -0.034094

weekday 0.028801

workingday 0.027403

weathersit -0.144581

temp 0.403476

atemp 0.399118

hum -0.324475

windspeed 0.088802

casual 0.693962

registered 0.971979

cnt 1.000000

Humidity, weathersit and holiday have negative correlation

Temperature positive correlation

Linear Regression

RMSE 142.55471466066638

Mean: 142.60048335718665

Standard deviation: 3.951846872985428

Decision tree RMSE

RMSE 0.5989453436724405

Mean: 60.389922919513765

Standard deviation: 1.8442561348658812

Random Forest

RMSE 16.070527909994066

Mean: 43.761209948247476

Standard deviation: 1.601296445040747

importance score of each attribute in GridSearchCV

(0.5940383211759476, ‘hr’),

(0.1314648370277314, ‘temp’),

(0.08115055651157858, ‘yr’),

(0.054600523654828356, ‘workingday’),

(0.038212596573476344, ‘hum’),

(0.024857598473005206, ‘season’),

(0.023493412411162557, ‘weekday’),

(0.02033753184618983, ‘mnth’),

(0.016669847446518997, ‘weathersit’),

(0.012790932124135795, ‘windspeed’),

(0.0023838427554254754, ‘holiday’)

Final RMSE

41.40554866848994

Analysis

Best performance is from Random Tree algorithm, which got fine tuned to get the final RMSE.

Random tree has the lowest MEAN and Standard deviation compared to the Linear Regression and Decision Tree

Vikas_Chakravarthi · March 11, 2023, 12:01pm

Used BootML for this project.

Chose the following values:-

Type of project: Supervised Learning
Type of Supervised Learning: Regression
Performance Measure: Mean Squared Error
Dataset selection: Used existing bike dataset
Discarded fields: ‘instant’, ‘dteday’, ‘atemp’, ‘casual’, ‘registered’
Features: ‘season’, ’ yr’, ’ mnth’, ’ hr’, ’ holiday’, ’ weekday’, ’ workingday’, ’ weathersit’, ’ temp’, ’ hum’, ’windspeed’
Label - ‘cnt’
Split Data - Used the recommended 80:20 ratio for training and test; random seed was prefilled as 42 so left it as is; used stratified sampling for weathersit to include all seasons

Fixed missing values using imputer with median and used standardization since feature scaling technique is selected.

Entered number of folds for cross validation as 10 and selected all 3 algorithms - Linear Regression, Random Forest and Decision Tree.

Ran the generated Jupyter notebook and tested the models using the 3 algorithms.

Pranay_Kumar_Panda · March 28, 2023, 6:12pm

Objective of Bike Demand project is to predict the bike demand in the future by creating a suitable model based on the already existing data. We are using the dataset which contains the hourly rental bike demand data.

This prediction system falls under Supervised learning and regression algorithms are used to train and build the model.

The performance measure is Root Mean Square Error. The data file used is bikes.csv.

Irrelevant features like instant, dteday, atemp, casual and registered are discarded from the input file.

For Prediction for the bike demand, Cnt is selected as the label here.

Then we split the data into train:test as 80:20 with Seed = 42.

Visualize the data by checking the correlation between different features by selecting the kind of visualization, and then generate the correlations and scatter matrix.

Imputation of the data by data cleaning and feature scaling. The missing values are replaced with median and we selected standardization as the feature scaling technique.

Cross validation (10 folds) and with algorithms like Linear Regression, Random Forest and Decision Tree is carried out.
Hyperparameters are fine-tuned by selecting the grid search.

Finally, generate the machine learning code in the Jupyter notebook.

Then run the code and analyzed the model performance from the result.

The RMSE for the liner regression model is 142.55471466066638

Mean: 142.60048335718665

Standard deviation: 3.951846872985485

The RMSE for the Decision tree model is 0.5989453436724405

Mean: 60.40019917027024

Standard deviation: 1.9342512865620638

The RMSE for the Random forest model is 16.070058811417443

Mean: 43.73715385071471

Standard deviation: 1.6048984827096304

The final RMSE for the Random forest model after the fine tuning is

41.430448042559405

By comparing the various model we can see the standard is less in the case of Random forest model, which is also selected here and fine-tuned and the final RMSE is 41.43.

Prachi_Juyal · April 29, 2023, 6:49am

Used BootML to build the model

Dropped Column - instant’, ‘dteday’, ‘year’
Label is cnt
· There are no categorical fields in the dataset

Model Performance:

Linear Regression:
RMSE: 1.5917716337858432e-13
Cross validation Scores:
Mean: 2.1201973299296864e-13
Standard deviation: 6.532873981272535e-14

Decision Tree:
RMSE: 0.0
Cross validation Scores:
Mean: 5.384855596737983
Standard deviation: 0.6401427427741454

Random ForestL
RMSE: 1.0364170995450157
Cross validation Scores:
Mean: 2.7974040434406398
Standard deviation: 0.6818323846906849

Random Forest Regressor is the top performer, so we select RF for fine-tuning

Post hyper parameter tuning and the Final RMSE was 2.6717032775073686

Devika_Sanjay · August 16, 2023, 4:55pm

The goal is to develop a model to estimate the bike demand in future given the parameters as observed in the past. The dataset contains the hourly rental bike demand data.
Steps followed :

Import the data. Analyze the data.
Drop the irrelevant fields.
Understand the data.
Split the data into train and test.
Analyze the data through visualizations.
Preprocess the data for modelling (Data Cleaning, Feature Scaling).
Train the model.
Fine-tune the model.
Validate the models such as using RMSE and select the best model .

Ritvik_Kanchi · February 21, 2024, 2:14pm

Project topic: Predict the bike demand in future

To train a ML model that could predict the demand of the bike in future, a dataset was needed which was available at ‘/cxldata/datasets/bootml/Bikes_Data_1’. In this project, we used BootML to train a model that would use three different algorithms, namely Linear regression, Decision Tree and random forest for the prediction and the goal is to select the best algorithm which is more accurate. The accuracy of the algorithm will depend upon the Root Mean Square Error (RMSE). The algorithm with least RMSE will be the best one for the model.

This project could be completed using Supervised Regression, a type of training used to develop ML modals. Why Supervised Regression? Because we are predicting the values that are most likely to appear. After selecting the training type, we clean our data by discarding unwanted fields so to reduce the time and space complexity for the training. In this project, we discarded various fields like ‘instant’, ‘dteday’, ‘atemp’, ‘casual’, ‘registered’. When we use supervised learning, we need to define “Features” which are used by the algorithm to find patterns and “Label” which is the field that we want to predict. Here, our features were ‘season’, ‘yr’, ‘mnth’, ‘hr’, ‘holiday’, ‘weekday’, ‘workingday’, ‘weathersit’, ‘temp’, ‘hum’, ‘windspeed’ while our label was ‘cnt’. After categorizing the features and labels, if the dataset is huge, we need to split the data in smaller chunks so that we can manage it according to the systems RAM, ROM and processor, else we would directly feed the data to the memory. The data is then divided into the ratio of 80:20 where 80% data is used for training and rest 20% data is used to validate the model.

Then, we plot graphs to visualize the data so that we can understand on a deeper level about the cofactors. In this project, after visualizing the data, it was found that the negative cofactors affecting the bike demand were ‘holidays’, ‘weathersit’ and ‘humidity’. After this step, we proceed to use the library ‘Scikit-Learn Imputer class’ to fill missing values to clean our data. For any missing values, we substitute the missing value with the median of the entire column. We then remove the text attributes from the dataset because the class imputer works on numerical attributes only and not categorical values. We use ‘standard_scaling’ from the library Sci-kit learn to scale our data and all the numerical values like ‘yr’, ‘hr’ ,‘temp’ ,‘hum’ ,‘windspeed’ are scaled.

Since our data is ready, we can feed it to the ML models and check which of the models has least RMSE. First, we train a linear regression model. After training we obtain the following results:
RMSE: 140.42543515709906
Mean: 140.66747951182725
Standard deviation: 4.073381490301536

Now we train a Decision tree modal and obtain the following results:
RMSE: 0.5989453436724405
Mean: 61.75452780128436
Standard deviation: 1.9890246206300253

Lastly, we train a Random Forest model and obtain the following results:
RMSE: 16.136436478906223
Mean: 43.85423021350333
Standard deviation: 1.6025089829841799

After observation, we find that the RMSE of Decision Tree Model is 0.5989… which is very low compared to Random Forest, which is 16.1364… and Linear Regression, which is 140.4245… One might say that the Decision Tree model is the best one for having the lowest RMSE but that’s not true. While the RMSE for Decision Tree is 0.5989… its mean is 61.7545… which is far away from the RMSE value, which indicates that this model is Overfitting, Hence we discard the Decision tree model. Now we have Linear Regression model and Random Forest model to choose from. Once again, Random Forest model is an overfitting model because it has a huge difference between the RMSE and its mean while the linear regression has both RMSE and mean values close to each other. Although the Linear Regression model performed well, it has a huge RMSE while Random Forest model can be tuned by training the model with more data.

It can be concluded that although Linear Regression had close RMSE and mean values, the Random forest appears to be more promising in providing better predictions. The final RMSE for the Random Forest model was 52.01065305006697

Ankur_Bhargava · March 1, 2024, 7:21pm

Predict the bike demand in future

Here we are required to build the model which estimates the bike demand in future given the parameters as observed in the past.
First we observer the data provided as csv .
Second we try to visualize the data.
Later we train the model using

LinearRegression - RMSE=142.55471466066638 and cross validation

Scores: [141.73385995 137.17408611 146.54823227 140.01681714 139.5934433
140.75983645 147.21587309 146.94230424 148.04977578 137.97060524]
Mean: 142.60048335718665
Standard deviation: 3.951846872985485

DecisionTreeRegressor RMSE= 0.5989453436724405
and cross validation
Scores: [61.83845429 61.87772316 58.57910773 58.97797571 60.57058431 59.66936478
58.00785112 60.66513375 58.99932935 64.81646751]
Mean: 60.40019917027024
Standard deviation: 1.9342512865620638
RandomForest RMSE=16.070058811417443

and cross validation Scores: [41.39109047 44.91729434 45.62312915 43.93183516 43.34900343 41.97845154
43.13046448 44.21152565 42.1283239 46.71042038]
Mean: 43.73715385071471
Standard deviation: 1.6048984827096304

Finally we fine tune the Model using grid search
and got
Best Param - {‘max_features’: 8, ‘n_estimators’: 30}
Best estimation: Random Forest

And we receive Final RMSE= 41.430448042559405

Rajat_Mishra · April 16, 2024, 11:55am

Project Objective : Predict the bike demand in future

First I tried looking at the bigger picture by

A)Understanding the business objective and how the created solution will help the business.
B)We will be using Supervised Regression as we need to predict median house value which is a continuous variable and we already have the input features and the expected labels in the dataset which means it is a supervised problem.
I figured out that I need to build different models and choose the best one based on the performance metric chosen.
C) I chose root mean squared error as the performance metric.A typical performance measure for regression problems is the Root Mean Square Error (RMSE) which is the square root of Mean Squared Error.
The mean squared error basically is the sum of squares of the errors in each prediction.

In this project, I used BootML to train a model that would use three different algorithms, namely Linear regression, Decision Tree and random forest for the prediction and the goal is to select the best algorithm which is more accurate.

Import the data :
- Load the dataset from the specified path ‘/cxldata/datasets/bootml/Bikes_Data_1’.
Analyze the data :
- Explore the dataset to understand its structure and features.
Drop the irrelevant fields:
- Discard unwanted fields such as ‘instant’, ‘dteday’, ‘atemp’, ‘casual’, ‘registered’ to reduce complexity.
Understand the data
- Identify the features and the target variable(label).
- Here, our features were ‘season’, ‘yr’, ‘mnth’, ‘hr’, ‘holiday’, ‘weekday’, ‘workingday’, ‘weathersit’,
  ‘temp’, ‘hum’, ‘windspeed’ while our label was ‘cnt’.
- Understand the distribution and relationship between variables.
Split the data into train and test
- Divide the dataset into training and testing sets, typically using an 80:20 ratio.
- I stratified the data based on weathersit.
Analyze the data through visualizations
- Plot graphs to visualize the relationships between different variables to gain deeper understanding
  of the cofactors.
- Identify any patterns or trends in the data.
- After visualizing the data, it was found that the negative cofactors affecting the bike demand were
  ‘holidays’, ‘weathersit’ and ‘humidity’.
Data Cleaning & Preprocessing
- Clean the data by handling missing values using techniques like imputation.
- Using the library ‘Scikit-Learn Imputer class’ to fill missing values to clean our data. For any missing values, we substitute the missing value with the mean/median/zero.If there are excess missing values we delete the specific rows or the entire column.
- We then remove the text features from the dataset because the class imputer works on numerical attributes only and not categorical features.
- I selected “Median” as the imputer.
- Scale the numerical features using standard feature scaling to bring them to a similar scale.All the numerical values like ‘yr’, ‘hr’ ,‘temp’ ,‘hum’ ,‘windspeed’ are scaled.

Our Data is ready to be fed to the ML Algorithms. We will feed the data to different models and figure out which model scores best in terms of chosen performance metric.

Train the model
- Train machine learning models using different algorithms like Linear Regression, Decision Tree, and Random Forest.

After training different ML Models we obtain the following results:

Linear regression model.
RMSE: 142.72311067462795
Mean: 142.77813742724223
Standard deviation: 3.7030212904153004

Decision tree model :
RMSE: 0.5989453436724405
Mean: 59.78945853931843
Standard deviation: 3.251904916375031

Random Forest model :
RMSE: 15.984077447144026
Mean: 43.49698996838114
Standard deviation: 2.4391843492444463

Fine-tune the model
- Tune hyperparameters of the models using techniques like GridSearchCV or RandomizedSearchCV to improve performance.

The best hyperparameter combinations

grid_search.best_params_ :{‘max_features’: 8, ‘n_estimators’: 30}

Importance score of each attribute:
[(0.5844748088424341, ‘hr’),
(0.12946274868458668, ‘temp’),
(0.07775038801366478, ‘yr’),
(0.05910890006087216, ‘workingday’),
(0.039755394080211844, ‘hum’),
(0.028163876532795794, ‘season’),
(0.027393454181294748, ‘weekday’),
(0.01926425477094257, ‘mnth’),
(0.01890966484897769, ‘weathersit’),
(0.013190030433310636, ‘windspeed’),
(0.0025264795509088818, ‘holiday’)]

This indicates that the hr and temp are the highest contributors to affecting demand.

Upon analysis, it’s clear that while the Decision Tree model initially appears favorable due to its low RMSE, further examination reveals significant overfitting indicated by the large disparity between its RMSE and mean values. Consequently, we discard this model.

Turning to the Linear Regression and Random Forest models, while the former exhibits a closer alignment between RMSE and mean values, it suffers from a high RMSE. In contrast, although the Random Forest model displays overfitting, its potential for improvement through additional data training makes it a more promising option.

In summary, despite the Linear Regression model’s closer alignment between RMSE and mean values, the Random Forest model holds greater potential for delivering better predictions.

Best Estimator : Random Forest

Evaluating the model on Test Set
- Finally I validated the model using test data to ensure its generalization ability.

Final RMSE : 41.96929332427234

Mayank_Belwal · April 22, 2024, 6:08am

Objective: Predict the bike demand in the future by creating a suitable model based on the dataset bikes.csv (Bikes_Data_1)

Used BootML.
Type of project: Supervised Learning
Type of Supervised Learning: Regression
Performance Measure: Mean Squared Error

Discarded features that are not relevant for analysis and prediction:
instant, dteday, casual, registered

Here Label is ‘cnt’ - ie the count of bike demand

Data has been split into two sets - Training and test. The split ratio is 80:20
Random seed used is 42.

Stratified sampling is not used.

All the fields are numerical fields.

Data imputation is done by data cleaning and feature scaling. Missing values are replaced with ‘mean’, and Feature scaling is done using the ‘standardization’ method

Number of Folds for cross validation is 10.

Algorithms - Linear Regression, Random Forest and Decision Tree are used to train the models
Hyper parameter fine tuning is done using GridSearch

After the above configurations, jupyter code is generated.

After the code is executed, the following results are obtained.

We employed Linear Regression, Decision Tree, and Random Forest algorithms to predict bike demand. After evaluating their performance, the following insights were gathered:

Linear Regression Model:

RMSE: 142.45227
Mean: 142.50187
Std Dev: 4.00133The RMSE is notably high compared to the other models, indicating lower predictive accuracy. Therefore, this model was disregarded for further consideration.

Decision Tree Model:

RMSE: 0.59895
Mean: 60.47994
Stddev: 2.29817While the RMSE is the lowest among the three models, the cross-validation mean is relatively high, suggesting potential overfitting issues. Hence, the suitability of this model for deployment in production is questioned.

Random Forest Regressor Model:

RMSE: 15.98372
Mean: 43.53210
Stddev: 1.70738Although the RMSE is higher compared to the Decision Tree model, the variation in cross-validation mean, albeit indicative of overfitting, appears reasonable. Additionally, the overfitting can be mitigated by incorporating additional data. Hence, this model was selected for fine-tuning.

After fine-tuning the Random Forest Regressor model, the final RMSE improved to 41.13327. If this level of performance is acceptable, the model can proceed to deployment in production. However, if higher accuracy is desired, it is recommended to reassess the quality of the data, consider incorporating additional data, and retrain the model for improved performance.

Aditya_Khokhar · July 1, 2024, 4:11pm

Data Set - ml/machine_learning/datasets/bike_sharing at master · cloudxlab/ml · GitHub
Selected BootML for this Project

Discarded instant, dteday

Lable was (cnt)

Training set and test set split as 80:20 ratio which was 80 for training and 20 for testing.

Used Median as the imputer

Moved nothing to the Categorical Data.

Used algorithms - Linear Regression, Decision Tree and Random Forest

Linear Regression

RMSE - 142.55471466066638

Mean - 142.60048335718665

Standard Deviation - 3.951846872985428

Decision tree RMSE

RMSE 0.5989453436724405

Mean: 60.389922919513765

Standard deviation: 1.8442561348658812

Random Forest

RMSE 16.070527909994066

Mean: 43.761209948247476

Standard deviation: 1.601296445040747

importance score of each attribute in GridSearchCV

What i analysed ?

Best performance is from Random Tree algorithm, which got fine tuned to get the final RMSE.

Random tree has the lowest MEAN and Standard deviation compared to the Linear Regression and Decision Tree

Vinod_R · December 26, 2024, 6:30am

Objective:
The goal is to predict bike rental demand (cnt) using historical data and evaluate the model’s performance using RMSE.
Dataset:
Used bikes.csv from the BOOTML processor.
Features included weather conditions, temporal data, and user behavior metrics.
Data Cleaning:
Dropped Index and Date columns as they did not add predictive value.
Checked for missing values and applied median imputation.
Feature Engineering:
Retained all other columns as numerical features.
No additional transformations or encoding were performed for categorical variables.
Target Variable:
Chose cnt (total rental count) as the label since it represents the desired prediction.
Data Splitting:
Split the dataset into 80% training and 20% testing data.
Used a random seed (42) for reproducibility.
Preprocessing:
Standardized all numerical features using StandardScaler.
Model Selection:
Evaluated the following algorithms:
- Linear Regression: Baseline model to test linear relationships.
- Decision Tree: To capture non-linear patterns.
- Random Forest: To address overfitting issues with Decision Trees.
Fine-tuned Random Forest using GridSearchCV.
Performance Metric:
Selected RMSE as the primary metric for evaluating prediction accuracy.

the results were showed overfitting for most so used Randomforest to balance

Results:
Final Model: Random Forest with fine-tuned parameters.
Test RMSE: 3.71.

Rodrigo_Rocha_Souza · January 6, 2025, 6:55am

For the exercise to predict bike usage I Used Boot ML with existing Bikes Assessment Project.

Supervised Learning using regression measured by Mean Squared Error.

Dataset used in repository “Bikes Data”. Dataset File bikes.csv

Discarded columns by default in existing project were:

Instant

Dteday

Atemp

Casual

Registered

Label as data we want to predict is “cnt”

Training and test set as 80:20 with 42 as random seed

No Categorical fields were identified on remaining columns on the data

Feature scaling using Standardization

Correlations:

Temp (0.4) and hr (0.39 are the highest positive correlations and hum (-0.32) and weathersit (-0.14) with the highest negative correlations with cnt.

Models:

Linear Regression Model

RMSE 142.55

Cross Validation

Mean: 142.60048335718665

Standard deviation: 3.951846872985428

Decision Tree

RMSE:0.59

Cross Validation

Mean: 60.389922919513765

Standard deviation: 1.8442561348658812

Random Forests

RMSE: 16.07

Cross Validation

Mean: 43.761209948247476

Standard deviation: 1.601296445040747

Based on results Random Forests seems to be the most efficient model with Lower RMSE and cross validation in comparison to other models, plus a low Standard deviation.

Abhay_Kumar_Mishra · January 6, 2025, 6:55am

In response to Predict the bike demand in future task:

Following the checklist sequence of approaching ML projects, I used BootML to build, select and fine tune the model as below:

I looked at the big picture and thought what could be factors affecting bike demand and concluded that temperature, rain, etc climate conditions can be as well as price, traffic and road safety.
I gathered the data and saw that it has features which seemed relevant such as season and temperature and also field which seemed not relevant such as record index, etc.
I explored the data , and I confirmed my assumptions that there is a positive correlation with temperature and season and found out negative correlation with humidity
I prepared the data for ML by replacing the missing values with median value, stratified the data based on season and performed feature scaling.
I built 3 ML models, using Linear Regression, Decision Tree and Random Forest algorithms respectively. I received the following results for RMSE and cross-validation mean RMSE:

Linear Regression: RMSE 142; Cross Valid RMSE: Mean 142
Decision Tree: RMSE 0.59; Cross Valid RMSE: Mean 60
Random Forest: RMSE 16; Cross Valid RMSE: Mean 43.8

From these results, I concluded that the Linear Regression model didn’t perform well as it had high RMSE for both the initial set and the cross validation folds. The Decision Tree algorithm showed very significant difference which indicates overfitting of the model as it failed to generalize well during the cross validation. I evaluated and selected the Random Forest as the best model from the 3 in this case.
6) BootMl has also selected this model for fine-tuning and has performed hyperparameters tuning through Grid Search algorithm as I’ve selected previously.
7) The model showed that the feature importance score for prediction is as follows:

See the importance score of each attribute in GridSearchCV

feature_importances = grid_search.best_estimator_.feature_importances_

sorted(zip(feature_importances, attributes), reverse=True)

[(0.5940383211759476, ‘hr’),

(0.1314648370277314, ‘temp’),

(0.08115055651157858, ‘yr’),

(0.054600523654828356, ‘workingday’),

(0.038212596573476344, ‘hum’),

(0.024857598473005206, ‘season’),

(0.023493412411162557, ‘weekday’),

(0.02033753184618983, ‘mnth’),

(0.016669847446518997, ‘weathersit’),

(0.012790932124135795, ‘windspeed’),

(0.0023838427554254754, ‘holiday’)]

which indicates that the hr and temp are the highest contributors to affecting demand.
The final RMSE after tuning was 41.43

If this RMSE is satisfactory, the model could be deployed. Otherwise further data can be added and the models re-run or other algorithms can be used with the same data in efforts to build a better performing model.

Ash_B · January 16, 2025, 9:21am

RMSE in Linear Regression Model - 140.43520202989316
Cross Validation for Linear Regression Mean: 180287643610.20673

RMSE in Decision Tree model - 0.453
Cross Validation for Decision Tree Model Mean: 61.3380

RMSE in Random Forest model - 16.234
Cross Validation in Random Forest model Mean: 44.18

final_rmse 51.509

Best Model: Random Forest model

Sakshi_Rai · February 23, 2025, 10:00am

After training the model, we evaluated its performance using RMSE (Root Mean Squared Error). The RMSE for our Linear Regression model was calculated, highlighting its predictive accuracy. Key learnings from this project include effective data preprocessing, feature selection, and handling categorical and numerical attributes using pipelines. We also explored feature scaling techniques such as standardization. Cross-validation helped assess model reliability. Insights from correlation analysis and scatter plots guided feature engineering. Future improvements could include hyperparameter tuning and testing alternative models like Decision Trees or Random Forest for better accuracy. This hands-on approach strengthened our understanding of end-to-end ML workflows.