End to End Project - Product Demand Forecasting

Hi,

I used the following approach. Please share your feedback.

  1. I ruled out solving the problem using the time series models for multivariate data. Prime reason being that the time index is not evenly spaced and index freq could not be set to start the time series analysis. Though this is not completely impossible if we use univariate models for each territory and product type and imputing the values as 0 for dates on which the sales did not happen. But the approach usually taken in time series forecasting helped to include the additional features. Example the lagged sales or the sales on previous day and week, difference in sales or one and second order.

  2. After basic exploratory analysis and getting rid of the duplicated values, I added the holidays, week number, month, weekend information based on the Order Date. I also added the previous day sales, difference in sales as additional features.

  3. Just to try out the subsequent approach, I intentionally picked all product type sales for a single Terrritory. So my dataset has the product IDs of all products, their respective total order quantities per date of sale and the remaining features which I had added earlier.

  4. As in time series, I partitioned the dataset in a sequential date time range without shuffling into training and testing dataset. From these I carved out the X_train,y_train, X_test and y_test dataset.

  5. I chose to use the ensemble method for this dataset. I tried Random forest regressor and Gradient Boosting regressor.

  6. The rmse of the both the models are close with Gradient Boosting marginally higher. For both I used n_estimators = 1000. I can further use Grid search.

  7. The models returned rmse of 4. I further created a dataframe of the y_test and the predictions. The predictions are very close to the true sales.

  8. RMSE of 4 might sound bad. But when I see the distribution of per day sales, the data is skewed.

  9. I further want to generalize this approach on the remaining territories.

I am eager to hear from anyone with their suggestion and feedback.

Thank you for your patient reading.

I could further improve the model. Now the RMSE is ~3.05

It is lesser than the y_test mean, median and the std. The model can be further improved using cross validation and using TimeSeriesSplit folding technique.