Value based methods are algorithms based on state and action value functions (\(V_\pi\) and \(Q_\pi\)) that estimate the best possible policy based on these value functions.
As these functions are approximations of the true (unobservable) state and action value functions, the best policy given these approximated functions won’t be optimal in the environment except if the value functions perfeclty approximate the true value functions.
The process to obtain an optimal policy alternate between defining a policy \(\pi\) (at initialisation \(\pi\) is generally a random or equiprobable policy), estimating state and action value functions \(V_\pi\) and \(Q_\pi\) and updating the policy based on these new estimation of the value functions.
Given a policy \(\pi\) we know its associated state and action value functions:
Their exist to type of models:
Model based learning methods are methods that assume the knowledge of the underlying model of the environment.
Model free learning methods are methods that do not assume the knowledge of the underlying model of the environment.