Monte-Carlo method is a model free learning method used in finite and stochastic (or deterministic) environments with one or more final states. In this case the underlying model of the environment is unknown ie the rewards \(r\) and the transitions probabilities \(p(s',r \; \vert \; s,a)\) are unknown.
Let’s introduce the notion of episodes: a sequence that lead to a final state (in Super Mario the sequence would be a tentative in a level and the victory or defeat would be the final state).
The goal of the agent is then to find \(\pi\) that maximises:
\[\mathbb{E}\left[\sum_{t=1}^{T} r_t\right]\]The idea of the Monte-Carlo method is simply to generate a large number of episodes in order to estimate the state and/or action value function using the law of large numbers. An important point is that the update of the state or action value function can be done at the end of all the episodes or it can be done at the end of every episode.
Once the state and action value functions obtained we can improve the policy \(\pi\) in order to find the optimal policy \(\pi^{*}\).
The policy evaluation in the Monte-Carlo setup will generate a full episode and take advantage of knowing the whole information of the episode in order to estimate the state-value function.
Here the return \(G_t\) can be computed as we know the whole episode and have all the informations about future rewards.
This method is called ‘First-visit MC’ in opposition to ‘Every-visit MC’:
For example if during an episode the state 3 is vised twice with a first reward of 2 and a second reward of 6, the ‘First-visit MC’ would give \(\pi(3)=2\) and the ‘Every-visit MC’ would give \(\pi(3)=(2+6)/2=4\).
Changing from ‘First-visit MC’ to ‘Every-visit MC’ is easy, it is sufficient to take the return each time the agent visits the state \(s\) instead of just taking the first one.
The output state-value function is a Monte-Carlo estimation of the true state-value function.
Once the state-value function is known new episodes are generated and using these episodes and the state-value function it is possible to estimate the action-value function:
As for MC policy estimation we use here the ‘First-visit MC’ in opposition to ‘Every-visit MC’. Once againg changing from ‘First-visit MC’ to ‘Every-visit MC’ is easy, it is sufficient to take the return each time the agent takes action \(a\) from state \(s\) instead of just taking the first one.
The output action-value function is a Monte-Carlo estimation of the true action-value function. It is called the \(Q\)-table (a table with \(rows=states\) and \(columns=actions\)).
Let’s introduce the \(\varepsilon\)-Greedy policy (greedy in this context means best) to define a better policy \(\pi'\) from a policy \(\pi\) and its action-value function (also called \(Q\)-table).
We saw in the deterministic framework (part on Policy improvement) that using a policy \(\pi\) it is possible to find a new policy \(\pi'\) such that \(\pi' \geq \pi\) by associating to each state the action that maximises the reward.
However in the stochastic case the \(Q\)-table (action-value function) used to estimate the best possible action is an estimation of a stochastic distribution. Due to this randomness an error in the estimation of the action-value function can lead us to always choose an action that is not optimal.
For example imagine an state where an action 1 gives a reward of 1 every time and an action 2 gives 80% of the time a reward of 0 and 20% of the time a reward of 1000. With a classic greedy policy we have a high probability of selecting action 2 and receiving a reward of 0 and thus decide to always choose action 1 in the future.
\(\varepsilon\)-Greedy policy is a stochastic solution to this problem: it will select the current best action with probability \(1-\varepsilon\) and any other action with probability \(\varepsilon\). Like this there always be a chance to test again the others actions.
The reste of the algorithm is similar to the deterministic Policy improvment.
Once we know how to estimate the \(Q\)-table of a policy \(\pi\) and that we know how to define an \(\varepsilon\)-Greedy policy from \(\pi\) then we can alternate these two steps to find an opitmal policy \(\pi^{*}\).
The algorithm is still based on the Monte-Carlo principle of generating a large number of episodes.
Here an hyperparameter \(\alpha\) is set to determine the update speed of the \(Q\)-table. A high \(\alpha\) gives more importance to last seen values. Another solution would be to set each state/action of the \(Q\)-table as an average of the past values of this state/action (ie replace \(\alpha\) by \(\frac{1}{N(S_t, A_t)}\)). However giving more importance to more recent values leads to better convergence as more recent policies are closer to the optimal policy.
Here is an illustration of the impact of \(\alpha\): if \(\alpha=0\), their is no update and if \(\alpha=1\) the new value only rely on the last estimate:
\(\alpha\) is generally a small value in order to rely on the existing \(Q\)-table. It still gives more importance to last seen values (\(\alpha \gt \frac{1}{N(S_t, A_t)}\) for a large number of samples).
See: