Revision

Back to Deep Reinforcement Learning

Deep Q Network: Introduction

Deep Q Network (DQN) is a value based model.

In 2015, Deep Mind developed a new type of Reinforcement Learning model that only took raw pixels as input.

The heart of the model was a Convolutional Neural Network (CNN).

The Deep Q Network (DQN) is still a value based method as it estimates the action value function in order to estimate the policy, hence it can’t deal with actions with continuous spaces:

The update of the parameter is done using a temporal differences method.

Policy estimation

The policy is estimated taking the best action ie the action that maximise the futur returns. It is done in a \(\varepsilon\)-greedy fashion:

Here the best action is the action “jump”. With \(\varepsilon\)-greedy method, the model would choose the action “jump” with probability \(1-\varepsilon\) and any other action with probability \(\varepsilon\) (\(\varepsilon/(nb_{actions}-1)\)).

Introduction example: Pong example

To play pong game the DQN model tooks a sequence of 4 frames as input (4 frames of 84X84 greyscaled pixels) and output the expected return for each action, the best policy is then to choose (in an \(\varepsilon\)-greedy fashion) the action that leads to the best returns:

Training

The goal of the DQN is to update its parameters \(w\) to fit the action value function \(q_\pi^*\). The model interacts with the environment and take advantage of the obtained information to update its network.

The update rules is the same as the update rule in temporal differences: the total return \(G_t\) is unobserved and is approximate by the observed reward and the returns of the next state (estimated using our current DQN).

Let \(\Delta(w)\) be the update of the parameters:

Remark that DQN uses the best action for state \(s'\) like in Sarsamax.

Also remark that this update rule is exactly the result of applying Gradient Descent in order to optimize the Mean Square Error loss between the result of our DQN and \(q_\pi^*\) using the approximation \(R + \gamma \max_{a}\hat{q}(S',a,w)\) of \(q_\pi^*\):

Tricks

In order to helps the convergence of the network calibration, the DQN uses some tricks.

Experience replay

Experience replay separate the exploration part of the agent from the training part where the model is updated.

An experience is an ensemble of 4 values \(\{S_t, A_t, R_{t+1}, S_{t+1}\}\) that can be used to update the models.

Experience replay uses the current model to interacts with the environment (from a state, choose an action, observe next reward and next states) and stores these interactions in a replay buffer. After a batch of time the model can use this replay buffer to call back the experiences in whatever order to update its parameters. Once the model is updated it returns interacting with the environment to stores new experiences.

The agent stores the experience and uses it later to update the model.

Here is another visualisation of experience replay. The replay buffer collects different experience and then each mini batch is composed of randomly choosen experiences from replay buffer:

Experience replays avoid to stay stuck in a configuration where one action gave a good reward and hence the agent always choose this action.

Also it transforms the reinforcement learning problem in a series of supervised problems.

Prioritized Experience replay

It is possible to prioritized the experience replay in order to select the experiences with high error more often in an AdaBoost fashion:

The different steps to use Prioritized Experience replay are:

Compute the error \(\delta_t\) of the experience,
Compute its associated priority \(p_t=\delta_t+\varepsilon\) where \(\varepsilon\) is small value to avoid getting null priority (that would associated a null probability to select this experience),
Compute the sampling priority using a modified softmax function using an hyperparameter \(0 \leq a \leq 1\) where \(a\) gives more or less importance to experience with important error (\(a=0\) is uniform and \(a=1\) relies totally on priority),
Modify the update rule as a non uniform sampling of batch is not an estimate of the population (see Prioritized Experience replay paper for details).

Fixed target

The problem with our update rule is that the objective value (that is an approximation of the true objective function \(q_\pi\)) is dependent on the same variables as the network we want to update. We call this a moving target.

In order to avoid this problem and to fix the target we use the parameters from last iteration in the estimation of the next state/action return:

Then DQN maintains two networks, the ‘regular’ network and the ‘target’ network which is a lagged copy of the regular network. It is well reprsented in this image:

Double Q-Learning

The approximation \(R + \gamma \max_{a}\hat{q}(S',a,w)\) of the true action function \(q_\pi\) over estimates \(q_\pi\). We take the best action possible to make this proxy but the benefits of each action are estimated. By taking the maximum of unstable values we overestimate the true \(q_\pi\).

The idea of Double Q-Learning is to estimate the best action using one network and to compute the associated return using a second network:

For the DQN model we already keep a second set of parameters \(w^{-}\) in the fixed target.

Hence we evaluate the best action using \(w\) and compute its return using \(w^{-}\):

Dueling Network

Dueling Network modifies the structure of the classic DQN to estimate the state value function from one part and defined the action value function as the state value function + an adjustment (an advantage value) coming from the action.

Using this, all actions from a state share a common basis value which seem logical:

The advantage function \(A_\pi(s,a)\) is the advantage of an action \(a\) with respect to a state \(s\). It is the difference between the action value function \(Q_\pi(s,a)\) and the state value function \(V_\pi(s)\):

\[A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)\]

Equivalently:

\[Q_\pi(s,a) = V_\pi(s) + A_\pi(s,a)\]

Which is used in the Dueling Network.

Limitations

Finite number of actions,
Non native stochastic policy (stochasticity comes from the \(\varepsilon\)-greedy method).

Resources

See:

DQN: UDRLN videos 2.5.1, 2.5.3,
Experience replay: UDRLN videos 2.5.4, 3.4.9,
Fixed target: UDRLN videos 2.5.5,
Double DQN: UDRLN videos 2.5.7,
Prioritized Experience replay: UDRLN videos 2.5.8.