Click on the top ofzhisheng” to follow, star or pin to grow together

Flink goes from beginner to proficient Series of articles

Today’s recommendation system has higher and higher requirements for real-time, and the

process of real-time recommendation can be roughly summarized as follows: The recommendation system makes recommendations for the user’s request, and the user gives feedback on the recommendation results (purchase/click/leave, etc.), and the recommendation system makes new recommendations based on user feedback. There are two interesting points in this process:

  • which can be regarded as a process of continuous interaction and interaction between recommender systems and users.

  • Recommender systems need to respond quickly and promptly to user feedback.

These two points are achieved through reinforcement learning and Flink, respectively, and before that, we will understand some background concepts.

Reinforcement Learning: An Introduction begins:


we think about the nature of learning, the first thing that comes to mind may be learning through constant interaction with the environment. When a baby is playing, waving his arms, or looking around, no teacher teaches him, but it does perceive changes in its surroundings directly.

The main process of reinforcement learning is to build an agent that continuously learns in the process of interacting with the environment in order to obtain the maximum expected reward. It is a very versatile learning paradigm that can be used to model a wide variety of problems, such as games, robotics, autonomous driving, human-computer interaction, recommendations, health care, and more. The main difference from supervised learning is that reinforcement learning learns through trial-and-error based on delayed feedback, while supervised learning is learning with clear feedback at each step.

The following diagram reflects the process by which a recommender agent interacts with the environment. Here, the user is regarded as the environment, and the user’s behavior data is input into the agent as a state, and the agent makes corresponding actions accordingly, that is, recommends the corresponding items to the user, and the user’s response to this recommendation such as click/no-click, purchase/non-purchase, etc. can be regarded as a new round of rewards. It can be seen from here that recommendation can be considered a dynamic sequence decision-making process, the recommended items have an impact on the user, and the user’s feedback in turn affects the decision-making of the recommendation system itself, and this process will continue to form a sequence.

The word “decision” actually represents the characteristics of reinforcement learning itself. Imagine that when a person is making decisions, many times it is necessary to assess the rapidly changing situation and make quick choices accordingly, while on the other hand, the decision needs to consider long-term goals rather than just short-term benefits. These two points are precisely the problems about traditional recommendation methods that are mentioned in almost all papers using reinforcement learning as recommendation systems, that is, the process of treating recommendations as static predictions and focusing only on short-term benefits. Of course, the main reason for saying this in the paper is to highlight your own achievements, but the actual situation should be far from simple.

The ultimate goal of reinforcement learning is to learn a strategy π(a|s) to maximize the desired reward. Policy refers to how the agent decides the next action a according to the environmental state s, corresponding to the recommended scenario is to decide the next recommended item based on the user’s past behavior records. For how to obtain the optimal strategy through training, there are currently two mainstream methods: on-policy and off-policy. Unlike supervised learning, which requires manual data collection and labeling in advance, reinforcement learning data comes from constantly interacting with the environment and then updating the model with the collected data. So there are two parts related to the strategy in this process, one is to use the strategy when interacting with the environment, and the other is to update the policy when training. On-policy means that the policy of environment interaction and the policy updated during training are the same policy, and the corresponding off-policy is that different policies are used when interacting and updating. The figure on the left is on-policy and the lower figure is off-policy (the figure on the right is the offline method, which will be described later).

The idea of on-policy is more intuitive, which is equivalent to an agent learning while trial and error in the environment, but its main problem is low sample utilization and low training efficiency. After using a policy to interact with the environment to obtain data and then update the model, a new policy is generated, then the data obtained from the interaction of the old policy may not obey the conditional distribution of the new policy, so the data can no longer be used and will be discarded.

Off-policy alleviates this problem by storing the data collected by the previous policy through an experience replay buffer, from which the data is sampled for training. So why can the off-policy class method be trained on data generated by the old policy? Since the different data distribution causes the new and old data to be trained together, then adjust the data distribution to make it close, so the off-policy algorithm generally uses the idea of importance sampling to impose different weights on different data, which will be mentioned later when introducing YouTube’s recommendation system, and then it will be said.

So which reinforcement learning method is applicable to this article? This is actually not easy to say. I don’t have an environment to interact with, only a static dataset, so off-policy seems more suitable, but even off-policy methods usually need to interact with the environment to constantly produce new data for training. So the approach in this article belongs to batch reinforcement learning, or offline reinforcement learning, but I tend to use the word batch because offline and off-policy are easily confused. The figure on the right above shows batch (offline) reinforcement learning, which is characterized by collecting a batch of data at once and then using only this batch of data for training, and no longer interacting with the environment before the official deployment.

We know that deep learning has made great strides in the field of images and NLP in recent years, largely due to the explosion of computing power and data. However, for mainstream deep reinforcement learning algorithms, it is necessary to constantly interact with the environment to obtain data, which usually means collecting data while training. However, many domains are not qualified to interact with the environment as frequently as traditional reinforcement learning, there are reasons why the cost is too high or the safety is too low, and even cause ethical problems, such as autonomous driving and medical treatment. So at this time, people will naturally think that when training reinforcement learning, they also collect a bunch of fixed data, and then constantly reuse it, no longer collect new ones, and do miracles on fixed data sets like deep learning, is it feasible? Therefore, batch reinforcement learning has attracted more and more attention from academia and industry in recent years, and is widely regarded as an effective way to achieve large-scale application of reinforcement learning. The recommender system is very suitable for this model, because the direct online exploration interaction is too expensive and affects the user experience, but it is relatively easy to collect user behavior logs and the amount of data is large.



on the other hand, recommender system as a system, algorithms alone will definitely not work. As mentioned above, batch reinforcement learning does not need to interact with the environment, only the dataset can be trained, then after the training model is really online, it needs to interact with the environment, and this process needs to have an intermediate carrier for quickly obtaining information, cleaning the original data and converting it into a format that can be input into the model. In this article, we mainly use Flink for this preliminary process. The self-introduction on Flink’s website is “Stateful Computations over Data Streams”

In other words, as data flows in, it can save and access previous data and intermediate results, and calculate them together when certain conditions are reached. For our reinforcement learning model, we need to accumulate a certain amount of user behavior to be used as model input for recommendation, so we need to save the previous behavior data in Flink in real time, which requires Flink’s powerful state management function.

In addition, the deep learning framework used for offline training is PyTorch, which is not as convenient to deploy as Tensorflow, so the popular FastAPI in recent years is used as an API service, and after obtaining the features that meet the conditions in Flink, the service is directly called for inference, and the recommendation is generated and stored in the database, and the server can directly call the recommendation result from the database the next time the user requests.

The overall architecture is shown in the figure below, and the complete code and flow are shown in FlinkRL ( and DBRL (

The following introduces the three algorithms used, limited by space, here is only a general introduction to the principle, for details can refer to the original paper. Here is the main symbol table:


This method is based on YouTube’s 2018 paper Top-K Off-Policy Correction for a REINFORCE Recommender System. In this video, the authors claim that the method has seen the biggest increase in nearly two years, and I’m honestly a little skeptical. In the experimental part at the end of the paper, it is mentioned that this reinforcement learning model is only used as one of the many recall models, and then all the recalled items are recommended to the user after passing through an independent sorting module, and the article does not say what model this sorting module uses, so the space in it is relatively large.

The paper uses the oldest REINFORCE algorithm in the field of policy gradient and makes some changes to its specific business situation, here we will first look at the basic framework of REINFORCE.

Assuming that a stochastic strategy is executed, a trajectory generated by the interaction of agents in the environment is τ=(s0,a0,s1,a1,r1,...,sT−1,aT−1,sT,rT). In deep reinforcement learning, neural networks are generally used to parameterize the policy π, and multiple trajectories are generally sampled in the environment, so the expected total return of the strategy is:

where θ is the parameter of the neural network and R(τ) is the total return of the trajectory τ, because the trajectory is random, so we want to maximize the expected return to get the optimal strategy π∗:

The idea of REINFORCE, or the whole policy gradient, is the same as many algorithms in supervised learning, that is, to find the parameter θ by the gradient rise (falling) method, and treat the (1.1) formula as the objective function, then once there is a gradient, you can use this familiar formula for optimization:

It is difficult to find the gradient of J(πθ) directly, but with the policy gradient theorem we can get an approximate solution:

where Rt=∑|τ|t′=tγt′−tr(st′,at′), meaning that the final reward of the action taken at t moment is only related to the reward obtained later, not to the previous reward. Proofs of the policy gradient theorem are provided in Appendix A.

The original REINFORCE algorithm was on-policy, that is, the online interaction and the actual optimization of the same strategy, but the paper said that they interacted online with different strategies, or even a mixture of different strategies. THIS RESULTS IN INCONSISTENT DATA DISTRIBUTION, WHICH CAN GENERATE A HUGE BIAS IF YOU USE REINFORCE DIRECTLY. In the paper, the algorithm is changed to off-policy by importance sampling:

For the actual interaction strategy, the derivation of this formula can also be derived directly from the policy gradient, as detailed in Appendix B. After a series of trade-offs, the author believes that the following equation balances bias and variance more reasonably:

the main difference is that the sampling is along β trajectory is carried out, and the importance weights πθ(at|st)β(at|st) are added to each step t, and this weight is relatively easy to calculate.

Another problem considered in the paper is that there is only one action that has been considered before, that is, only one item is recommended, but in reality, k items are often recommended to the user at one time, and if k items are considered at the same time, the combination will explode. The paper assumes that the total reward for displaying k unique items at the same time is equal to the sum of the rewards for each item, which can be converted into optimization of individual items, and the authors say that this hypothesis is true on the premise that the user’s observation of each item in the recommendation list is independent.

YouTube published another paper in 2019 (Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology), which was mainly different from the mid-2018 method The on-policy SARSA algorithm is used in the sorting rather than recall phase. This paper also makes a certain assumption about the recommended k items: assuming that a user in a recommendation list will only consume one of the items, and if the user returns to the recommended list to consume the second item after consumption, this behavior will be treated as a separate event and recorded separately in the log. In fact, the essence of these two assumptions is that the user will only pay attention to one of them and not the others in the face of a list of k items, but in fact, many times users will be interested in more than one, but forget the remaining few after consuming one. The extreme case is that 10 are recommended that all are interested, and after consuming one, some things leave or fall into a continuous click cycle consumption, so that the original other 9 interested are treated as negative samples.

Here we can see that the fundamental reason why some assumptions must be made in both papers is that the algorithms used output discrete actions, that is, individual objects, but the recommended scenarios are not like general reinforcement learning applications that only need to output one action. So I have to make some seemingly awkward assumptions, and the two algorithms introduced later output continuous actions, and there will be another solution.

Next, following the above assumption, k are converted into a single item, and when combined with the importance weight, the equation (1.3) can be converted to:

is mainly used in the equation αθ(at|st) replaces the original πθ(at|st) because k items are independently sampled from the strategy πθ, then αθ(at|st)=1−(1−πθ(at|st))K represents the probability that item a will appear in the final list of k items at t moment, because (1−πθ(at|st))K represents the probability that K times have not been sampled.

The only difference between equation (1.3) and equation (1.4) is that there is an additional multiplier : ∂α(at|st)∂π(at|st)=K(1−πθ(at|st))K−1. So we just sample a trajectory, calculate πθ(a|s), β(a|s) and (these two are actually the probabilities of the softmax output) at each step t, and then combine the calculation of the discount return Rt=∑|τ|t′=tγt′−tr(st′,at′). After that, the algorithm can be implemented accordingly, and the final code is shown in


As far as the recommended scenarios are concerned, discrete actions are more natural ideas, and each action corresponds to each item. However, in reality, the number of items may be at least one million, which means that there is a lot of space for action and a high computational complexity in softmax. The above YouTube paper uses sampled softmax to alleviate this problem, and here we can change the way of thinking, let the strategy output a continuous vector a∈Rd, and then multiply a with the embedding point of each item to get a list of recommendations, and use efficient nearest neighbor search online to get the corresponding item. DDPG is a more general choice for continuous actions, and DDPG is used the most in the recommended papers I have read, such as two articles from [1][2], one from Ali[1], and one from Huawei [1].

DDPG, short for Deep Deterministic Policy Gradient, is an off-policy algorithm for continuous actions. UNLIKE REINFORCE ABOVE, IT USES A DETERMINISTIC STRATEGY, WHICH, AS THE NAME IMPLIES, PRODUCES A UNIQUE ACTION FOR THE SAME STATE S, SO HERE WE USE μ(S). And because it is a deterministic strategy, there is no probability distribution for action a, so there is no need for importance sampling.

DDPG adopts the Actor-Critic framework, with the Actor as a specific strategy, its input is state s, and its output action is a. Critic is used to evaluate the quality of a strategy, and its input is a vector stitched together from (s+a) and output as a scalar. Both Actor and Critic can be parameterized with neural networks, assuming that the Actor network is μ(s|θμ) and the Critic network is Q(s,a|θQ), then the objective functions and gradients of Actor and Critic are:

Then the core of the algorithm is to optimize these two objective functions by gradient ascent (descent) to obtain the final parameters, and then obtain the optimal strategy. DDPG other implementation details such as target network, soft update, etc. will not be repeated here, because we use a fixed data set, so as long as the data is converted into a format that the DDPG algorithm can input, and then divided into batch training like supervised learning, no need to interact with the environment, the final code sees


The full name of BCQ algorithm is Batch-Constrained Deep Q-Learning, from the paper Off-Policy Deep Reinforcement Learning without Exploration. BCQ can be seen as a modified version of DDPG in the batch (offline) scenario, as mentioned earlier, batch (offline) reinforcement learning refers to learning on a fixed data set and no longer interacting with the environment. The authors point out that the current popular off-policy algorithms such as DQN and DDPG may not work well under such conditions, mainly because of extrapolation errors.

Extrapolation error mainly stems from the inconsistency between the distribution of state s and action a combinations in the dataset and the distribution of state-action combinations in the current strategy, that is, the

sampling strategy is very different from the current strategy, which makes Critic’s estimation of the value function inaccurate, and then makes the learning invalid. The objective function of DDPG’s Critic network (changed some symbols) is an example above:

Because μ′ itself is a neural network, if μ′(s′) ends up producing an action that is not in the dataset, it is likely that Q′(s′,μ′(s′)) will not estimate the state-action combination accurately, and then a good strategy will not be learned. The figure below (source) shows that if action A is outside the distribution of behavioral strategy β, it is possible to overestimate the Q value, resulting in subsequent errors accumulating.

I did encounter a similar situation when actually training DDPG, and the loss of critic sometimes reached an exaggerated level of 1e8, no matter how much you adjust the learning rate. Later, it was found that the possible reason was initially to average multiple item vectors that the user had interacted with before as an expression of state s, but the averaged vector may not look like any single item vector, that is, far from the original data distribution. Then the action a output of μ′(s) is naturally far from the action in the dataset, so that the final Q value explodes due to the transmission of one ring after another, and later changed to the item vector is directly spliced.

In addition, the authors also mentioned that extrapolation errors are not considered in common off-policy algorithms such as DQN and DDPG, why are they effective in orthodox reinforcement learning tasks? Because these algorithms essentially use the training method of growing batch learning: after training offline with a batch of data for a period of time, the trained policy will still be used to collect new data in the environment for training, so that the sampled data is actually not much different from the existing policy, so the extrapolation error is negligible. However, in the case of full offline training, the dataset is likely to be collected using a completely different strategy, so the impact of extrapolation error is more significant.

So the crux of the question is how to avoid generating an inexplicable combination of states and actions that lead to extrapolation errors? The approach in the paper is to use a generative model to generate similar state-action combinations in the data set, and then use a perturbation model to add some noise to the generated actions to increase variety. Among them, the generation model uses variational auto-encoder (VAE), the perturbation model is an ordinary fully connected network, the input is the generated action a, the output is the new action in the range [−Φ, Φ], Φ is the maximum value of the action, for continuous actions we generally set an upper limit to avoid outputting too large an action value. Taken together, these two models can be viewed as the strategy used by the BCQ algorithm, i.e., Actors, and the Critic part is not much different from DDPG, see

Appendix A: Policy Gradient for the full code The theorem

follows from the equation (1.1) that the objective function is :

Appendix B: Importance Sampling sets


trajectory distribution of the actual interaction strategy β as Qφ(τ) and applies importance sampling to the (A.2) formula:

and appendix The derivation of A is the same.

Original address:


    > Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction
  • Minmin Chen, et al. Top-K Off-Policy Correction for a REINFORCE Recommender System
  • Xiangyu Zhao, et al. Deep Reinforcement Learning for List-wise Recommendations
  • Scott Fujimoto, et al. Off-Policy Deep Reinforcement Learning without Exploration
  • Sergey Levine, Aviral Kumar, et al. Offline Reinforcement Learning: Tutorial, Review,and Perspectives on Open Problems
  • Eugene Ie , Vihan Jain, et al. Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

public number ( zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Keywords such as Java, Kafka, monitoring can view more keywords corresponding to the article.

Buy Me A Coffee