Sometimes the Gaussian distribution is not complex enough to accurately represent the input distribution. Figure 4(a) gives an example where the maximum likelihood estimate for a single Gaussian in 2 dimensions does not accurately capture the structure of the dataset.
Figure 4: The Gaussian distribution being unimodal, a single Gaussian is unable to capture the structure of this dataset. Using a mixture of Gaussians allows for a better fit. The red lines give the points at $1$ and $2$ standard deviations from the mean of each Gaussian.
However, if we combine $K$ Gaussians such that each point from the dataset is sampled from one such Gaussian, the model becomes much more expressive as in Figure 4(b). In fact, given enough Gaussians, it becomes possible to approximate any distribution with arbitrary accuracy. This can be done with the introduction of a hidden variable $\mathbf{h}$ to represent the choice of a Gaussian. The model is then a latent variable model, i.e.
$$p(\mathbf{x}|\theta)=\sum_{\mathbf{h}}p(\mathbf{x}|\mathbf{h},\theta)p(\mathbf{h}|\theta)$$
where $\mathbf{h}$ is the random variable associated with the random choice of a Gaussian among the $K$ possible choices. In this example, we choose the one-hot representation for the variable $\mathbf{h}$, i.e. all possible values of the vector $\mathbf{h}$ are such that exactly one component is $1$ and all the others are $0$. We note $\mathbf{h}_{i}$ the value of $\mathbf{h}$ corresponding to the choice of the $i^{\text{th}}$ Gaussian. For instance, for $K=3$, the possible values are $\mathbf{h}_{1}=[1,0,0]$, $\mathbf{h}_{2}=[0,1,0]$, $\mathbf{h}_{3}=[0,0,1]$.
The probability of choosing the $i^{\text{th}}$ Gaussian is given by $p(\mathbf{h}=\mathbf{h}_{i}|\theta)=\pi_{i}$, where the vector $\boldsymbol{\pi}$ is called the mixing parameter. The parameters $\pi_{i}$ are such that their sum is $1$ to ensure that the probability distribution stays normalized.
Once the Gaussian $i$ is chosen, the probability of sampling a vector $\mathbf{x}$ is simply given by
$$p(\mathbf{x}|\mathbf{h}=\mathbf{h}_{i},\theta)=\mathcal{N}(\boldsymbol{\mu}_{i},\boldsymbol{\Sigma}_{i})$$
where $\boldsymbol{\mu}_{i},\boldsymbol{\Sigma}_{i}$ are the mean and covariance matrix of the $i^{\text{th}}$ Gaussian distribution.
Recomposing the full probability distribution, we have
$$p(\mathbf{x}|\theta)=\sum_{i=1}^{K}\mathcal{N}(\boldsymbol{\mu}_{i},\boldsymbol{\Sigma}_{i})\pi_{i}$$
where the sum over the index $i$ covers all possible Gaussian affectations. The latent variable $\mathbf{h}$ is no longer visible in the above equation but is implicitly marginalized over with the index $i$. The parameter $\theta$ regroups all the parameters of the model, i.e. $\theta=\{\boldsymbol{\mu}_{1},\dots,\boldsymbol{\mu}_{K},\boldsymbol{\Sigma}_{1},\dots,\boldsymbol{\Sigma}_{K},\pi_{1},\dots,\pi_{K}\}$.
In the case of Gaussian mixtures, we cannot find a closed-form solution to the maximum likelihood problem. However, we can use the EM algorithm introduced above to find a local maximum of the likelihood. For the first step of EM, we must find $p(\mathbf{h}|\mathbf{x})$ to infer the hidden variables from a data instance $\mathbf{x}$. This conditional probability is given by Bayes’ rule, namely
$$p(\mathbf{h=\mathbf{h}_{i}}|\mathbf{x},\theta) = \frac{p(\mathbf{x}|\mathbf{h}=\mathbf{h}_{i},\theta)p(\mathbf{h=\mathbf{h}_{i}}|\theta)}{p(\mathbf{x}|\theta)}$$
$$ = \frac{\mathcal{N}(\boldsymbol{\mu}_{i},\boldsymbol{\Sigma}_{i})\pi_{i}}{\sum_{j=1}^{K}\mathcal{N}(\boldsymbol{\mu}_{j},\boldsymbol{\Sigma}_{j})\pi_{j}}$$
This implies a sum over all Gaussians in the denominator and as such can be costly when a large number of Gaussians are involved. Although the EM algorithm can be run with arbitrary mixing parameters $\pi_{i}$ and covariance matrices $\boldsymbol{\Sigma}_{i}$, the case where all $\pi_{i}$ are equal and where the covariance matrices are a multiple of the identity matrix, i.e. are of the form $\boldsymbol{\Sigma}_{i}=\sigma I$ with $\sigma$ a constant common to all Gaussians, has an interesting interpretation. In this case, the most probable value of $\mathbf{h}$ for an observation $\mathbf{x}$ is $\mathbf{h}_{i}$ if $\boldsymbol{\mu}_{i}$ is the mean closest to $\mathbf{x}$.
For the second step of EM, i.e. the maximization of the likelihood of $p(\mathbf{x},\mathbf{h}=\mathbf{h}_{i})$ w.r.t. the mean parameters $\boldsymbol{\mu}_{i}$, it simply consists in maximizing the likelihood of $\mathbf{x}$ under the $i^{\text{th}}$ Gaussian. The maximum likelihood for the Gaussian is derived in Section Maximum Likelihood for the Gaussian and is given by
$$\boldsymbol{\mu}_{i}^{*}=\frac{1}{\left|\mathcal{D}\right|}\sum_{\mathbf{x},\mathbf{h}}\mathbf{x}.$$
In the special case of equal mixing parameters and with isotropic Gaussians, the two steps above correspond exactly to the $K$-means algorithm where a centroid $\mathbf{c}_{i}$ corresponds to the mean of the $i^{\text{th}}$ Gaussian $\boldsymbol{\mu}_{i}$. The $K$-means algorithm has therefore a probabilistic interpretation as the application of the EM algorithm to a mixture of equiprobable Gaussians of equal isotropic variance.
Although EM can recover the $K$-means algorithm under specific conditions, EM is much more general and can be used with arbitrary mixing parameters and covariance matrices.
Next: Optimization revisited in the context of maximum likelihood