Revision

Back to Deep Reinforcement Learning

Policy based methods

Policy based methods get rid of the estimation of the action value function to estimate the optimal policy and just directly estimate the optimal policy.

Comparison with value based methods

Here is a comparison of value based methods and policy based methods.

Value based methods

Value based methods estimate:

The state value function \(V_\pi(s)\) and the optimal state value function \(V_*(s)\),
The action value function \(Q_\pi(s,a)\) and the optimal action value function \(Q_*(s,a)\),
The advantage value function \(A_\pi(s,a)\) and the optimal advantage value function \(A_*(s,a)\).

And then they compute the policy using \(\varepsilon\)-greedy method for stochastic policies (or greedy method for deterministic policies).

Policy based methods

Policy based methods estimate directly the policy:

\(\pi(a \vert s)\) for stochastic policies,
\(\pi(s)\) for deterministic policies.

Advantages

Continuous space action

As policy based methods does not compute estimated return for each action it can generalize to action with continuous space.

Action based method relied on discretisation of the action space to deal with continuous action space and it is very inefficient (specifically to choose the best action over a large number of possible actions).

Simplicity

It is more natural to directly estimate the best policies instead of deriving it from an estimated action value function.

Stochasticity

Policy based methods can generate true stochastic policy instead of adding randomness using the \(\varepsilon\)-greedy method.

Compare to the DQ network which outputs a estimated returns for each action, a policy based method will directly output a probability for each action.

REINFORCE: Monte-Carlo policy gradient

REINFORCE (Monte-Carlo policy gradient) is a policy gradient method.

Policy gradient method are a sub class of policy based methods that researche the best policy using the stochastic gradient ascent method (similar to stochastic gradient descent but to find maximum values).

Gradient ascent update rule is:

\[\theta = \theta + \alpha \nabla_\theta U(\theta)\]

Where:

\(\alpha\) is the learning rate,
the rest of the notations are defined in the next section.

Notation

Policy gradient methods introduce:

Trajectory \(\tau = (s_0, a_0, s_1, a_1, \ldots, s_H, a_H, s_{H+1})\) ie a sequence of states and actions
Total return associated with th trajectory \(R(\tau)=r_1 + r_2 + \ldots + r_H + r_{H+1}\)

Let’s also introduce:

\(\theta\), the set of parameters of the model that dictates the policy,
\(U(\theta)\), the expected return,
\(P(\tau, \theta)\) the probability of obtaining trajectory \(\tau\) given policy \(\theta\).

Problem

The problem is hence to find the set of parameters \(\theta\) that maximises the expected return \(U(\theta)\):

Algorithm

Here is a vulgarisation of the algorithm of the policy gradient method:

Proof of the derivative

The proof is in the appendice: proof of REINFORCE gradient. It comes from this Medium blog post by Chris Yoon.

Alternative proofs can be found on this blog post by Lilian Weng or Reinforcement Learning: An Introduction by Sutton and Barto.

Apart from the derivative that is not obvious to obtain, this is a classic application of gradient ascent.

Limitations

Need a batch of complete episodes to make one step of update,
High variance.

Resources

See:

Advantages of policy based methods: UDRLN video 3.1.7,
Notation and problem: UDRLN video 3.2.4,
Algorithm: UDRLN video 3.2.5., 3.2.6, 3.2.7 and Reinforcement Learning: An Introduction by Sutton and Barto section 13.2.