Example: Maximum likelihood for the Gaussian

Course completion
54%
$
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\dom}{dom}
\DeclareMathOperator{\sigm}{sigm}
\DeclareMathOperator{\softmax}{softmax}
\DeclareMathOperator{\sign}{sign}
$

Let us now, as an example, derive the maximum likelihood estimates for a $D$-dimensional Gaussian probability distribution. The goal is then to maximize the log-likelihood, i.e. to find
$$\boldsymbol{\mu}^{*}=\argmax_{\mu}\log p(\mathcal{D}|\boldsymbol{\mu},\boldsymbol{\Sigma})$$
and replacing the model by its definition:
$$\boldsymbol{\mu}^{*}=\argmax_{\mu}\log\prod_{\mathbf{x}\in\mathcal{D}}\frac{1}{(2\pi)^{D/2}\left|\boldsymbol{\Sigma}\right|^{1/2}}\exp\left\{ -\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{T}\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right\} $$
The problem simplifies into
$$\boldsymbol{\mu}^{*}=\argmin_{\mu}\sum_{x\in D}(\mathbf{x}-\boldsymbol{\mu})^{2}$$
As seen in the optimization course, if the problem has a solution, it must satisfy the first order necessary condition, i.e. the first derivative of $\sum_{\mathbf{x}\in D}(\mathbf{x}-\boldsymbol{\mu})^{2}$ with respect to $\boldsymbol{\mu}$ must be $0$ at the optimum $\boldsymbol{\mu}^{*}$:
$$\frac{\partial\log p(\mathcal{D}|\boldsymbol{\mu},\boldsymbol{\Sigma})}{\partial\boldsymbol{\mu}}=\sum_{\mathbf{x}\in D}\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})$$
and therefore

$$\sum_{\mathbf{x}\in D}\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu}^{*})=0$$
which is equivalent to
$$\boldsymbol{\mu}^{*}=\frac{1}{\left|\mathcal{D}\right|}\sum_{\mathbf{x}\in\mathcal{D}}\mathbf{x}$$

Therefore the maximum likelihood estimate for the mean parameter of a $D$-dimensional Gaussian is the arithmetic mean of the training samples.

A similar computation yields the maximum likelihood estimate for the covariance matrix:
$$\boldsymbol{\Sigma}^{*}=\frac{1}{\left|\mathcal{D}\right|}\sum_{\mathbf{x}}(\mathbf{x}-\boldsymbol{\mu})(\mathbf{x}-\boldsymbol{\mu}^{T})$$

Next: Example – Probabilistic polynomial regression