Optimization revisited in the context of maximum likelihood

Course progress
90%

We have now reviewed how a learning problem results from solving an optimization one, sometimes with a probabilistic interpretation. We now come back to optimization with a special focus on the importance of choosing a metric.

An efficient optimization method must concentrate the search in regions of the search space which have a higher chance of containing an optimum. This is often done with an iterative approach in which the goal is to make at each step a small movement from the current position toward better values.

This kind of method depends on a metric to define neighborhoods in the search space in which the objective function assumed to have small variations. Unfortunately, practitioners often choose the canonical metric i.e. the Euclidean distance in $\mathbb{R}^{d}$ or the Hamming distance in $\{0,1\}^{d}$ regardless of its suitability to the problem under consideration.

We start by presenting how the use of a canonical metric in the context of log-likelihood maximization with gradient descent leads to an undesirable dependence on parametrization and then, we present the natural gradient which is simply the ordinary gradient in the Fisher metric and is specially adapted for moving in the space of probability distributions.

Gradient dependence on metrics and parametrization

In the previous chapters, we have often proposed the gradient descent algorithm to maximize the log-likelihood w.r.t the parameters of a model. However, gradient descent can yield different trajectories depending on the parametrization of a model. More precisely, the gradient descent procedure is often considered in parameter space, usually with the Euclidean metric.

Consider the Gaussian family in one dimension with the usual parametrization, i.e.
$$p_{\mu,\sigma}(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left\{ -\frac{(x-\mu)^{2}}{2\sigma^{2}}\right\} $$
where the parameters are the mean $\mu$ and the variance $\sigma^{2}$. The matter is especially confusing with the Gaussian family because the variance is noted as the standard deviation squared. Should we therefore take the derivative w.r.t. $\sigma^{2}$ or w.r.t. $\sigma$, and more importantly are the two alternatives equivalent ? They are not:
$$\frac{\partial\log p_{\mu,\sigma}(x)}{\partial(\sigma^{2})} = \frac{\mu^{2}-2\mu x-\sigma^{2}+x^{2}}{2\sigma^{4}}
\frac{\partial\log p_{\mu,\sigma}(x)}{\sigma}$$
$$= \frac{\mu^{2}-2\mu x-\sigma^{2}+x^{2}}{\sigma^{3}}$$
with the partial derivative w.r.t. the parameter $\mu$ remaining unchanged.

This can be attributed to the Euclidean metric which is implicitly used in these computations. The gradient gives the direction of greatest increase for an infinitesimal movement $\delta\theta$ in parameter space such that $\left\Vert \delta\theta\right\Vert < \epsilon$. When we consider the parameters $\mu,\sigma^{2}$, the distance $\left\Vert \delta\theta\right\Vert =\sqrt{(\delta\mu)^{2}+(\delta\sigma^{2})^{2}}$, is different than the one obtained by considering the parameters $\mu,\sigma$, i.e. $\left\Vert \delta\theta\right\Vert =\sqrt{(\delta\mu)^{2}+(\delta\sigma)^{2}}$. Consequently, the two gradients give the direction of greatest increase in log-likelihood but allow for movements of different amplitude in parameter space.

Although this problem of choosing a parametrization is especially clear with the choice of $\sigma$ or $\sigma^{2}$ as variance parameter, it is important to realize that the Euclidean metric introduces a spurious connection between the amplitude of gradient steps along different parameters. When we generalize to multi-dimensional Gaussians, these spurious connections typically favor Gaussians which are close to isotropic because the Euclidean metric gives an equal importance to every parameter of the covariance matrix.

In fact, every metric defined by a constant matrix in parameter space will favor one kind of Gaussian over another.

The natural gradient

A consequence of the previous section is that there are infinitely many possible gradients of the log-likelihood, each one corresponding to a different choice of metric. The natural gradient [Amari1998, Amari2000] is defined as the gradient in the Fisher information metric which is an approximation of the KL-divergence. As it relies on the KL-divergence, which is a common measure of difference between distributions, the natural gradient can be seen as moving on a manifold of probability distributions which is independent of any parametrization. Although the KL-divergence is not a metric, for infinitesimal movements $\delta\theta$ around $\theta$, the KL-divergence $d_{KL}(P_{\theta},P_{\theta+\delta\theta})$ can be approximated by its Hessian: the Fisher information metric. Because $\delta\theta=0$ corresponds to the global minimum of the KL-divergence, the Hessian corresponds to a second order approximation. The Fisher metric is defined by the so-called Fisher matrix:
$$F_{ij} = \mathbb{E}\left[\frac{\partial\log p_{\theta}(\mathbf{x})}{\partial\theta_{i}}\frac{\partial\log p_{\theta}(\mathbf{x})}{\partial\theta_{j}}|\theta\right]$$
Formally, the expression of a gradient $\nabla_{A}$ in some arbitrary metric $A$, for $A$ a symmetric positive definite matrix, is given by
$$\nabla_{A}=A^{-1}\nabla$$
The natural gradient $\tilde{\nabla}$ is then simply the gradient in the Fisher metric $F$, i.e.:
$$\tilde{\nabla}\log p_{\theta}(\mathbf{x})=F^{-1}\nabla\log p_{\theta}(\mathbf{x})$$

From this definition, the natural gradient corresponds to an infinitesimal movement in the space of distributions as opposed to a movement in parameter space. This makes the natural gradient invariant to parametrization because the metric measures a distance between the distributions themselves, independently of parameters. As a side effect, the invariance to parametrization often results in further invariances, depending on the probability distribution family, such as invariances w.r.t. rescaling, translation and rotation of $\mathbf{x}$ in the case of the multi-dimensional Gaussian.

Interestingly, the step size or learning rate of a natural gradient update is a quantity in bits or nats and therefore measures a quantity of information. The learning rate can then be seen as measuring how much information each step should provide.

Although the natural gradient has many theoretical advantages, it has a computational disadvantage, at least with a naive implementation, because it requires computing and inverting a matrix of size $(\dim\theta)^{2}$ at each step.

The natural gradient gives a good example of the influence of choosing a suitable metric in the context of maximizing the likelihood of a model. It is also a reminder of the intricate link between learning and optimization, here in the case of an optimization procedure which is specially designed for learning distributions.

Next: Summary