Maximum a-posteriori and maximum likelihood

Course completion
36%
$
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\dom}{dom}
\DeclareMathOperator{\sigm}{sigm}
\DeclareMathOperator{\softmax}{softmax}
\DeclareMathOperator{\sign}{sign}
$

When trying to learn a distribution, the goal is often to find the best possible parameter value. The problem is then to maximize the posterior distribution with respect to the parameter $\theta$, i.e. , we are looking for $\theta^{*}$ such that
$$\theta^{*}=\argmax_{\theta}p(\theta|\mathcal{D}).$$
$\theta^{*}$ is then called the maximum a-posteriori estimate because it maximizes the posterior distribution.

Note that the evidence $\sum_{\theta}p(\mathcal{D}|\theta)p(\theta)$ can in fact be written $p(\mathcal{D})$ (marginalization rule) and does not depend on the parameter $\theta$. Applying Bayes’ rule, the maximization problem is therefore equivalent to
$$\theta^{*}=\argmax_{\theta}p(\mathcal{D}|\theta)p(\theta)$$
where we take the likelihood and the prior into account as expected.

In cases where there is no useful prior, the prior can be chosen to be uniform and therefore does not depend on $\theta$, i.e. $p(\theta)=cst$. We can further simplify the optimization problem into:
$$\theta^{*}=\argmax_{\theta}p(\mathcal{D}|\theta).$$
$\theta^{*}$ is then called the maximum likelihood estimate and is the value of $\theta$ which maximizes the likelihood of the data under the model.

In practice it is often useful to consider maximizing the log-likelihood $\log p(\mathcal{D}|\theta)$ instead of the the likelihood itself (maximizing the log-likelihood was already proposed in the previous section as a way to minimize the KL-divergence with the empirical data distribution $p_{\mathcal{D}}$). The two optimization problems are equivalent because the logarithm is a monotonously increasing function (this equivalence relies on the assumption that samples from the dataset are iid). Additionally, when the dataset $\mathcal{D}$ is composed of iid samples, the likelihood decomposes as a product of point-wise probabilities, as in $p(\mathcal{D}|\theta)=\prod_{\mathbf{x}\in\mathcal{D}}p(\mathbf{x}|\theta)$. The logarithm then serves to obtain a sum over the dataset, i.e the log-likelihood of the dataset is equal to the average log-likelihood on the dataset: $\log p(\mathcal{D}|\theta)=\sum_{\mathbf{x}\in\mathcal{D}}\log p(\mathbf{x}|\theta)$.

Next: Choosing a prior