Product Demand Forecasting

I was unable to perform the project on my cloudxlab so I did in my Anaconda

The input data Product Demand Forecasting: Historical sales values (Location:/cxldata/datasets/project/demand_sales_orders_2014_15.csv)

The features in the dataset was CustomerID, OrderDate, SalesOrderNumber, TerritoryID,
ProductID, UnitPrice, OrderQty

Some additional features that I created are year, month, week_day, week_of_month, Season,
is_holiday, Product_cat, Product_sub_category, year_cat

The year, month, week_day, week_of_month features were extracted from the OrderDate feature by first transforming into a datetime object.

From the season feature I created a function called getseason and for is_holiday I used the holidays module.

After doing Exploratory Data analysis some of the important facts were found

  1. A large number of products has been ordered a large number of times,ProductID 870 is ordered maximum number of times i.e. 9210 whereas ProductID 897 is ordered minimum number of times i.e 2 .Thus all the products has been ordered more than once.

2.Maximum number of orders are from TerritoryID 4 and 1 whereas least number of orders are from TerritoryId 2.

3.Some Customers are ordering a single product in large amount. The highest order by a customer for a single product is 72 ,whereas the lowest order is 1.

4.A large number of CustomerId is repeated that means the same customer is ordering different products.

5.A large number of Product are ordered from Territory 4 and 1 ,whereas Territory 2 has the least number of orders.

6.The highest order for Products are in the FALL ,SUMMER seasons .Least number of products are sold in SPRING season

7.The ProductId are just random numbers and they are not assigned with some sort of meaning or according to the demands of that product.However these ProductID are important feature to predict the demand,There is 266 unique ProductID and they can be treated as categorical variables ,but if we directly encode them it will create 265 unique features which will slow down the ML alogrithm.The solution to this is that we will categorize ProductID based on their demand range and then encode it .

Sampling Process

I define a function to categorize the product based on the total number of times they have been ordered and after that created a dictionary from DataFrame Z which have the feature ProductID as key and Product_Cat as value .This Dictionary will later be used in mapping the ProdcutID of the original dataframe to create Product_cat feature’’’

After creating the Product_cat feature I found that some of the Product_cat are ordered extremely less compared to other Product_cat,which is obvious .But the problem here is that it while splitting the data for train and test it could miss some of the Product_cat such as Category_9 ,Category_10.So its better to create another feature ‘Product_sub_category’ which we will later use in Sampling

I define a function to create the feature Product_sub_category which help us in having significant amount of all the categories in both train and test.

Sales in the year 2016 is less than 1% and sales in 2014 is 27.23% whereas around 72.4% of total sales takes place in the year 2015.

Since the sales is extremely less in 2016 and if we do normal sampling there is a chance we might miss some of the unique products that has been sold in 2016, So we need to do stratified sampling but before that I will create another feature year_cat

Finally Sampling is done based on ‘year_cat’,‘Product_sub_category’ and ‘is_holiday’ features.I am adding is_holiday feature because some unique product may be sold more than the other normal days and since the number of holidays is extremely less ,thus it should be used during sampling along with ‘year_cat’, ‘Product_sub_category’.

After sampling year_cat and Product_sub_ategory features from both the set were dropped as its of no use

Preprocessing Data

Performing OneHotEncoder and Scaling the data for the TerritoryID , year, month, week_day, week_of_month, Season, is_holiday, Product_cat and UnitPrice respectively

After OneHotEncoding we finally got 41 features.

Training Models

I trained 3 models LinearRegression ,DecisionTreeRegressor ,RandomForestRegressor
From all the three models LinearRegression Model is having the largest RMSE 2.5808939173786443 ,RMSE of RandomForestRegressor model is: 1.018495098768923.The Lowest RMSE 0.9578009389703717 is for the DecisionTree model.

The LinearRegression model maybe under-fitting clearly because as seen from the bar plot there is huge difference in predicted and actual values.Where as the RandomForest and DecisionTree model are giving good results as seen in the bar plot.Thus we need to cross validate to see which model is over-fitting and under-fitting and to choose the best model finally.

After cross-validation RandomForestRegressor Model is having best score.

After fine tuning the model and evaluating the final data on the best search estimator and features of the model

The RMSE of the Random Forest Model using RandomizedSearchCV is: 1.5653147552217777

The RMSE of the Random Forest Model using GridSearchCV is: 1.5650069756179146

Both the models are having almost same RMSE on the test data.

Check the entire codes on GitHub and point me out where should I improve the models