Forecasting retail store sales with deep learning using entity embeddings


Accurate retail sales demand forecasting is critical for optimal resource allocation, budget planning and other related retail tasks during the year. This problem is challenging because sales prediction depends on numerous external factors that can include weather, city population, unemployment, growth or marketing changes.

One challenge of modeling retail data is the need to make decisions based on limited history. Holidays and select major events come once a year, and so does the chance to see how strategic decisions impacted the bottom line. In addition, markdowns are known to affect sales.

State of the art methods for handling these tasks often rely on a combination of univariate forecasting models and machine learning methods. Such models usually require extensive tuning to set seasonality and other parameters. These type of models require manual feature extraction and frequent retraining which can become prohibitive when there are millions of time-series to be analysed.

In this paper we propose a novel end-to-end neural net- work architecture that outperforms the current state of the art sales forecasting methods on a public retail dataset. Our approach does not require the use of complicated model ensembles, and minimal domain-specific engineering. This article discusses some key concepts on how we applied neural networks to retail structured data.


The dataset contains historical sales data for 45 stores located in different regions, each store contains a number of departments. The company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. The dataset can be found here.

The dataset is divided in three tables: Stores, Features and Sales. The stores table contains anonymised information about the 45 stores, indicating the type and size of each store. The features table contains additional data related to the store, department, and regional activity for the given dates. The features table is described in Table 1. MarkDown data is only available after Nov 2011, and is not available for all stores all the time.

Name Description
Store the store number
Date week
Temperature average temperature in the region
FuelPrice cost of fuel in the region
MarkDown1-5 anonymized data related to promotional markdowns
CPI the consumer price index
Unemployment the unemployment rate
IsHoliday whether the week is a special holiday week

Table 1. Features description

The sales tables contains historical sales data, from 2010-02 to 2012-11. The sales table is described in Table 2.

Name Description
Store the store number
Dept the department number
Date the week
WeeklySales sales for the given department in the given store
IsHoliday whether the week is a special holiday week

Table 2. Sales data description

Our Model

Our model is based on Deep learning, which is a powerful class of machine learning algorithms that use artificial neural networks to understand and leverage patterns in data. Deep learning algorithms use multiple layers to progressively extract higher level features from raw data: this reduces the amount of feature extraction that is needed in other machine learning methods. The deep learning algorithm learns on its own by recognising patterns using many layers of processing. That is why the “deep” in “deep learning” refers to the number of layers through which the data is transformed. Multiple transformations automatically extract important features from raw data.

This is totally the opposite from, more traditional, rule based methods, where the manual input is on both the data analysis and feature extraction plus the rule creation, which is usually a tedious  process.

A categorical set of inputs is a type of data where we have different categories (or types) that are unrelated amongst each other. Each entity is now an embedding (vector) in new dimensions (hence the term entity embedding). Think of these different dimensions as different characteristics in the dataset. What we find, applying this technique, is a hidden representation that works for our specific problem. The hidden representation is learned by a neural network during the standard supervised training process. By mapping similar values close to each other in the embedding space, the model identifies patterns which would have been difficult to reveal for the categorical variables. This means that we can find useful patterns without performing much feature engineering!

The core idea in our model is therefore the use of entity embeddings, which means to use a different set of dimension to represent a categorical set of data. As represented in Figure 2.

Figure 2. Model architecture using entity embeddings.

Entity embeddings have been shown to work successfully when fitting neural networks on structured data. For example, the winning solution in a Kaggle competition on predicting the distance of taxi rides used entity embeddings to deal with the categorical metadata of each ride. Similarly, the third place solution on the task of prediction store sales for Rossmann drug stores used a much less complicated approach than the number one and twos solutions. More on Entity Embeddings in this paper.


Now, let us comparing our deep learning model against some of the most popular machine learning algorithms to showcase the predictive accuracy of deep learning models. The metric we choose to evaluate the regression models is the root mean squared error (RMSE).

Model RMSE
Linear 1.94
Random Forest 0.548
XGBoost 0.543
Neural Network 0.38

Table 3. Comparison between RMSE errors from different models obtained using 5 fold cross-validation.

The embeddings we have created capture latent features with minimum feature engineering. Our model outperforms both XGBoost and Random forests improving the performances by 42% as seen in Table 3.

As underlined in Figure 3, by using deep learning and embedding layers we can efficiently capture latent features difficult to engineer by hand, and the neural network model predicts the weekly sales accurately.

Figure 3. Real and forecasted weekly sales in number. Data for all the stores is shown.

Entity embeddings solve the disadvantages of simple variable encoding such as one-hot encoding. One-hot encoding variables with many categories results in very sparse vectors, which are computationally inefficient and make it harder to reach optimisation. Embeddings provide information about the distance between different categories. The advantage of using embeddings is that they can be learned, therefore representing each category better than what other models can approximate.

If you want to know more about the model, download the white paper using the form below

    How can LotusLabs help you?

    Building an AI system is clearly a complex undertaking. The right conditions must be in place to ensure that the system also works reliably in day-to-day operations, performing as planned. The factors that determine whether implementation is successful cover all levels of the retail business.

    At LotusLabs we are experts in Machine Learning and AI infrastructure. Our people work with your people, at all levels. Our methods help you find ways to put AI to work.

    You want to see AI drive value in every corner of your business. But how do you get started? And how do you get there before your competition? LotusLabs helps you define an AI Roadmap that contains your vision. With the roadmap ready, you can focus on projects with the highest return and least risk.

    Transform your business into an AI-driven enterprise, implementing machine learning models that solve complex business problems and drive real ROI on the path toward functioning AI-supported retail.