Revision

Back to Machine Learning


Definition

EM algorithm for Expectation-Maximization algorithm is an iterative optimisation method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.

The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximisation (M) step, which computes parameters maximising the expected log-likelihood found on the E step.


Description of the optimization problem

EM algorithm is used to maximise the log likelihood of unknown parameters \(\theta\) in a statistical model where the observed values \(x=(x^{(1)}, \ldots, x^{(n)})\) depends on latent variables \(z=(z_1, \ldots, z_k)\):

\[\begin{eqnarray} \max_\theta l(x, \theta) &&= \max_\theta p(x \vert \theta) \\ &&= \max_\theta \sum_{j=1}^k p(x, z_j \vert \theta)\\ &&= \max_\theta \sum_{j=1}^k p(x \vert z_j; \theta) p(z_j \vert \theta) \end{eqnarray}\]

In the continuous case it is: \(\begin{eqnarray} \max_\theta l(x, \theta) &&= \max_\theta p(x \vert \theta) \\ &&= \max_\theta \int p(x, z \vert \theta) dz \\ &&= \max_\theta \int p(x \vert z; \theta) p(z \vert \theta) dz \end{eqnarray}\)

If \(p(z \vert \theta)\) were known, the optimisation would be easy. However in this setup it can’t be maximise using classical optimization algorithms.


Algorithm

The idea is then to freeze the value of \(p(z \vert \theta)\) (E-step) and then optimise the parameter given the current value of \(p(z \vert \theta)\) (M-step). These two steps are done alternatively until convergence.

The E-step gives a lower bound to the optimisation problem that we maximise using the M-step. The new maximum is then use as the new lower bound etc.


Expectation step

The expectation defines \(Q(z)\) the current (and fix) distribution of \(z\) as:

\[Q(z) = p(z \vert x; \theta)\]

Where \(p(z \vert x; \theta)\) is computed using the current parameters. For example in a mixture of gaussian, the probability \(p(z \vert x; \theta)\) depends on the parameters \(\phi_j\), \(\mu_j\) and \(\Sigma_j\) associated with each latent variable \(z_j\).

The probability is computed similarly to what is done in Gaussian Discriminant Analysis.


Proof

For one \(x^{(i)}\) we want to maximise its log probability given the parameters \(\theta\).

We can express \(log p(x^{(i)}; \theta)\) as:

\[\begin{eqnarray} log p(x^{(i)}; \theta) &&= \log \sum_z p(x^{(i)}, z; \theta) \\ &&= \log \sum_z Q(z) \frac{p(x^{(i)}, z; \theta)}{Q(z)} \\ &&\geq \sum_z Q(z) \log \frac{p(x^{(i)}, z; \theta)}{Q(z)} \end{eqnarray}\]

Where:

Third line is obtained using Jensen’s inequality for a concave function applied on \(\log\) (concave function) and using the probabilistic view of Jensen’s inequality (seeing that as \(Q(z)\) is a probability distribution for \(z\), \(\sum_z Q(z) \frac{p(x^{(i)}, z; \theta)}{Q(z)}\) is an expectation).


Once this lower bound is obtained we need to choose a definition for \(Q(z)\). The idea is to make \(Q(z)\) match the current probability distribution of \(z\) given the current parameters \(\theta\).

In order to hold with equality, we need \(\frac{p(x^{(i)}, z; \theta)}{Q(z)}\) to be a constant value (if \(c\) is a constant then \(\mathbb{E}[f(c)]=f(\mathbb{E}[c])\)). So we need:

\[\frac{p(x^{(i)}, z; \theta)}{Q(z)} = c\]

It can be achieved easily defining \(Q(z)\) such that:

\[Q(z) \propto p(x^{(i)}, z; \theta)\]

If \(Q(z) = k \cdot p(x^{(i)}, z; \theta)\) then:

\[\frac{p(x^{(i)}, z; \theta)}{Q(z)} = \frac{p(x^{(i)}, z; \theta)}{k \cdot p(x^{(i)}, z; \theta)} = \frac{1}{k}\]

If \(k = 1/c\) we obtain \(\frac{p(x^{(i)}, z; \theta)}{Q(z)} = c\)/


Now as we need to choose \(Q(z)\) such that \(\sum_z Q(z)=1\) we define \(Q(z)\) as:

\[\begin{eqnarray} Q(z) &&= \frac{k \cdot p(x^{(i)}, z; \theta)}{\sum_w Q(w)}\\ &&= \frac{k \cdot p(x^{(i)}, z; \theta)}{\sum_w k \cdot p(x^{(i)}, w; \theta)}\\ &&= \frac{k \cdot p(x^{(i)}, z; \theta)}{k \cdot \sum_w p(x^{(i)}, w; \theta)}\\ &&= \frac{p(x^{(i)}, z; \theta)}{\sum_w p(x^{(i)}, w; \theta)}\\ &&= \frac{p(x^{(i)}, z; \theta)}{\sum_w p(x^{(i)}, w; \theta)}\\ &&= \frac{p(x^{(i)}, z; \theta)}{p(x^{(i)}; \theta)}\\ &&= p(z \vert x^{(i)}; \theta) \end{eqnarray}\]


We can show that using this definition of \(Q(z)\) we find the current value of our maximimum likelihood:

\[\begin{eqnarray} \sum_z Q(z) \log \frac{p(x^{(i)}, z; \theta)}{Q(z)} &&= \sum_z p(z \vert x^{(i)}; \theta) \log \frac{p(x^{(i)}, z; \theta)}{p(z \vert x^{(i)}; \theta)} \\ &&= \sum_z p(z \vert x^{(i)}; \theta) \log \frac{p(z \vert x^{(i)}; \theta) p(x^{(i)}; \theta)}{p(z \vert x^{(i)}; \theta)} \\ &&= \sum_z p(z \vert x^{(i)}; \theta) \log p(x^{(i)}; \theta) \\ &&= \log p(x^{(i)}; \theta) \sum_z p(z \vert x^{(i)}; \theta) \\ &&= \log p(x^{(i)}; \theta) \end{eqnarray}\]

Using \(P(A,B)=P(A \vert B)P(B)\) to pass from line 2 to 3 and the fact that \(\sum_z p(z \vert x^{(i)}; \theta)=1\) to pass from line 4 to 5.


We call the evidence lower bound (ELBO) the value:

\[\sum_z Q(z) \log \frac{p(x,z ; \theta)}{Q(z)}\]

The generalisation for every \(x^{(i)}\) follows:

\[l(\theta^{(t)})=\sum_{i=1}^n \text{ELBO}(x^{(i)}; Q_i^{(t)}, \theta^{(t)})\]


Maximisation step

The maximisation step is a classical maximisation of a log likelihood function once the distribution \(Q(z)\) of the latent variables has been fix:

\[\begin{eqnarray} l(\theta^{(t+1)}) && \geq \sum_{i=1}^n \text{ELBO}(x^{(i)}; Q_i^{(t)}, \theta^{(t+1)}) && \;\;\;\;\; \text{as Jensen's inequality can be applied to any } Q \text{ and } \theta \\ && \geq \sum_{i=1}^n \text{ELBO}(x^{(i)}; Q_i^{(t)}, \theta^{(t)}) && \;\;\;\;\; \text{as } \theta^{(t+1)} = arg\max_{\theta} \sum_{i=1}^n \text{ELBO}(x^{(i)}; Q_i^{(t)}, \theta^{(t)}) \\ && = l(\theta^{(t)}) \end{eqnarray}\]

Hence the maximum increase (or remain equal) at each step.

The optimisation of the parameters for a particular \(Q(zs)\) is done imilarly to what is done in Gaussian Discriminant Analysis.


Pseudo-code

Repeat until convergence (when increase of the MLE is less than a \(\varepsilon\) value) {


Resources

See: