Temporal differences (TD) method is a model free learning method used in finite and stochastic (or deterministic) environments with one or more final states. The underlying model of the environment is unknown ie the rewards \(r\) and the transitions probabilities \(p(s',r \; \vert \; s,a)\) are unknown.
The main advantage compare to Monte-Carlo methods is that the temporal differences methods do not require a full episode to update the state and action value function.
In temporal differences methods the agent also interact with the environment and the TD algorithm compares, for each state and action, the received reward plus the expected returns from the next state/action and the current estimation of returns from the current state/action. It leverages this comparison to update the state/action value function.
Here is an illustration of temporal differences (applied to action value function is \(Q\)-table):
Here, the agent is in state \(s_0\) (rock) and takes action \(a_0\) (right). The current expected return of this state/action is +6. From state \(s_0\) and action \(a_0\), the agent goes in state \(s_1\) (brick - the outputs of the actions are stochastic, choosing action ‘go right’ does not insure to go right) and it takes action \(a_1\) (here - SARSA algorithm - we need to know the next action of the agent to leverage the \(Q\)-table). The reward from action \(a_0\) is -1 and the expected return of state \(s_1\) and action \(a_1\) is, from the \(Q\)-table +8. The alternative estimate is thus \(+8-1=+7\) that we compare with the current estimate +6. Hence the new estimate will increase a little bit to take advantage of the last information.
The policy evaluation in the TD setups is done by leveraging the new estimate of the state value function. Here only the state value function is required:
It updates the state value function by comparing the current estimate for this state to the alternate estimate constructed as the reward obtain by the agent plus the estimate of the next state. The hyperparameter \(\alpha\) modifies the update speed.
\[\underbrace{V(s_t)}_{\text{New V}} = \underbrace{V(s_t)}_{\text{Current V}} + \alpha \; [ \overbrace{\underbrace{r_{t+1}}_{\text{Reward}} + \underbrace{\gamma V(s_{t+1})}_{\text{Actualized returns}}}^{\text{Alternative estimate}} - \underbrace{V(s_t)}_{\text{Current V}} ]\]Here a terminal state \(S_t\) and a number of episodes are used but TD(0) can be used in environment without terminal state as the update of the state value function is done after each action. In this case, a maximum number of actions would be set.
Three different algorithms exist to estimate the action value function (\(Q\)-table):
As for the TD(0) algorithm the three next algorithms uses a terminal state \(S_t\) and a number of episodes but it is possible to apply these algorithms in environment without terminal states as the update of the action value function is done after each action. In this case, a maximum number of actions would be set.
In Sarsa the agent need to choose the next action using the current policy.
In Sarsamax, the algorithm uses the \(Q\)-table to choose the best possible return for the next state.
In Expected Sarsa, the agent uses the \(Q\)-table to average the next return.
On-policy learning algorithms are algorithms with training methods that replicate exactly the interactions of the policy with the environment.
Sarsa is an on-policy learning as it replicates the interactions of the envrionment in its training process. The probability to choose any action in the training process it the same as in the distribution of the policy.
Off policy learning algorithms are algorithms with training methods that do not replicate exactly the interactions of the policy with the environment.
Sarsamax is an off-policy learning as it does not replicates the interactions of the envrionment in its training process. Sarsamax chooses the best action to make its alternative estimate but the distribution of the policy is different and does not insure to always choose the best action (with \(\varepsilon\)-greedy method).
The policy improvment is the same as in Monte-Carlo method. It takes the \(\varepsilon\)-Greedy policy given the current \(Q\)-table.
Once we know how to estimate the \(Q\)-table of a policy \(\pi\) and that we know how to define an \(\varepsilon\)-Greedy policy from \(\pi\) then we can alternate these two steps to find an opitmal policy \(\pi^{*}\).
See: