Revision

Back to Value based methods

Introduction

Temporal differences (TD) method is a model free learning method used in finite and stochastic (or deterministic) environments with one or more final states. The underlying model of the environment is unknown ie the rewards \(r\) and the transitions probabilities \(p(s',r \; \vert \; s,a)\) are unknown.

The main advantage compare to Monte-Carlo methods is that the temporal differences methods do not require a full episode to update the state and action value function.

In temporal differences methods the agent also interact with the environment and the TD algorithm compares, for each state and action, the received reward plus the expected returns from the next state/action and the current estimation of returns from the current state/action. It leverages this comparison to update the state/action value function.

Here is an illustration of temporal differences (applied to action value function is \(Q\)-table):

Here, the agent is in state \(s_0\) (rock) and takes action \(a_0\) (right). The current expected return of this state/action is +6. From state \(s_0\) and action \(a_0\), the agent goes in state \(s_1\) (brick - the outputs of the actions are stochastic, choosing action ‘go right’ does not insure to go right) and it takes action \(a_1\) (here - SARSA algorithm - we need to know the next action of the agent to leverage the \(Q\)-table). The reward from action \(a_0\) is -1 and the expected return of state \(s_1\) and action \(a_1\) is, from the \(Q\)-table +8. The alternative estimate is thus \(+8-1=+7\) that we compare with the current estimate +6. Hence the new estimate will increase a little bit to take advantage of the last information.

Temporal differences policy evaluation

The policy evaluation in the TD setups is done by leveraging the new estimate of the state value function. Here only the state value function is required:

It updates the state value function by comparing the current estimate for this state to the alternate estimate constructed as the reward obtain by the agent plus the estimate of the next state. The hyperparameter \(\alpha\) modifies the update speed.

\[\underbrace{V(s_t)}_{\text{New V}} = \underbrace{V(s_t)}_{\text{Current V}} + \alpha \; [ \overbrace{\underbrace{r_{t+1}}_{\text{Reward}} + \underbrace{\gamma V(s_{t+1})}_{\text{Actualized returns}}}^{\text{Alternative estimate}} - \underbrace{V(s_t)}_{\text{Current V}} ]\]

Here a terminal state \(S_t\) and a number of episodes are used but TD(0) can be used in environment without terminal state as the update of the state value function is done after each action. In this case, a maximum number of actions would be set.

Action value function estimation

Three different algorithms exist to estimate the action value function (\(Q\)-table):

Sarsa (or Sarsa(0)) requires the next state and next action to make its guess,
Sarsamax (\(Q\)-learning) requires the next state and define the next action as the one that maximises the future returns,
Expected Sarsa requires the next state and defines the next state return as an average of the returns given each possible action.

As for the TD(0) algorithm the three next algorithms uses a terminal state \(S_t\) and a number of episodes but it is possible to apply these algorithms in environment without terminal states as the update of the action value function is done after each action. In this case, a maximum number of actions would be set.

Sarsa

In Sarsa the agent need to choose the next action using the current policy.

Sarsamax (\(Q\)-learning)

In Sarsamax, the algorithm uses the \(Q\)-table to choose the best possible return for the next state.

Expected Sarsa

In Expected Sarsa, the agent uses the \(Q\)-table to average the next return.

On-policy and off-policy algorithms

On-policy algorithms

On-policy learning algorithms are algorithms with training methods that replicate exactly the interactions of the policy with the environment.

Sarsa is an on-policy learning as it replicates the interactions of the envrionment in its training process. The probability to choose any action in the training process it the same as in the distribution of the policy.

Off-policy algorithms

Off policy learning algorithms are algorithms with training methods that do not replicate exactly the interactions of the policy with the environment.

Sarsamax is an off-policy learning as it does not replicates the interactions of the envrionment in its training process. Sarsamax chooses the best action to make its alternative estimate but the distribution of the policy is different and does not insure to always choose the best action (with \(\varepsilon\)-greedy method).

Temporal differences policy improvment: \(\varepsilon\)-Greedy policy

The policy improvment is the same as in Monte-Carlo method. It takes the \(\varepsilon\)-Greedy policy given the current \(Q\)-table.

Temporal differences policy iteration

Once we know how to estimate the \(Q\)-table of a policy \(\pi\) and that we know how to define an \(\varepsilon\)-Greedy policy from \(\pi\) then we can alternate these two steps to find an opitmal policy \(\pi^{*}\).

Limitations

Finite number of states,
Finite number of actions,
Non native stochastic policy (stochasticity comes from the \(\varepsilon\)-greedy method).
Biais.

Resources

See:

UDRLN cheat sheet part 6,
UDRLN videos 2.3.4, 2.3.5, 2.3.6, 2.3.7,
For on and off policy methods: UDRLN video 3.4.10.