Gradient descent

Course completion
55%
$
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\dom}{dom}
\DeclareMathOperator{\sigm}{sigm}
\DeclareMathOperator{\softmax}{softmax}
\DeclareMathOperator{\sign}{sign}
$

A common local optimization method is the gradient descent algorithm. The gradient $\nabla f(\mathbf{x})$ has the direction of greatest increase of the function $f$ at $\mathbf{x}$.

$$\frac{\nabla f(\mathbf{x})}{\left\Vert \nabla f(\mathbf{x})\right\Vert }=\lim_{\epsilon\rightarrow0}\argmax_{\mathbf{z},\left\Vert \mathbf{z}\right\Vert \leq1}f(\mathbf{x}+\epsilon\mathbf{z})$$

The gradient can be computed using the partial derivatives w.r.t. each component of the input vector $\mathbf{x}$ :
$$\nabla f(\mathbf{x})=\left(\frac{\partial f(\mathbf{x})}{\partial x_{1}},\frac{\partial f(\mathbf{x})}{\partial x_{2}},\dots,\frac{\partial f(\mathbf{x})}{\partial x_{D}}\right)$$
In gradient descent (see the definition of the algorithm below – Gradient ascent is defined identically except for a change of sign in the update), we start at some initial guess $\mathbf{x}_{0}$ and iteratively take small steps of size $\delta\hspace{-1pt}t$ in the direction of $-\nabla f(\mathbf{x}_{k})$. In practice it is common to stop the algorithm after a predefined number of steps or when a the objective function has not decreased for some time. In the limit of infinitesimal step size, there is a guarantee that the algorithm decreases the value of $f$ at each step, and a guarantee that the algorithm converges to a local minimum if it doesn’t encounter a saddle point at which $\nabla f(\mathbf{x}_{k})=0$. However, a bigger step size allows the algorithm to move faster in the domain of $f$, possibly leading to faster convergence when it does not lead to oscillations. The figure below gives an example of gradient descent trajectory towards a local minimum.

Algorithm (Gradient descent)
Inputs: $f$, a function.
$\delta\hspace{-1pt}t$, the step size.
Outputs: $\hat{\mathbf{x}}$, approximation of a global minimum.
Variables: $\mathbf{x}_{t}$, candidate solution of the algorithm at time t.
begin
repeat $K$ times:
    $\mathbf{x}_{t+1}:=\mathbf{x}_{t}-\delta\hspace{-1pt}t\nabla f(\mathbf{x}_{t})$
return last position $\hat{\mathbf{x}}:=\mathbf{x}_{tmax}$.
end


gradient-trajectory
Figure 3: Two possible gradient descent trajectories. In (a) the objective function is well behaved which allows the gradient to move smoothly towards the optimum. In (b) the gradient starts to oscillate as it falls into a narrow valley, thus converging more slowly.

Gradient descent often behaves poorly when the objective function has narrow valleys which cause oscillations. When confronted with such functions, a possible approach is to use $2^{\text{nd}}$ order information from the Hessian, e.g. using Newton’s method [Nocedal2006] or Hessian-Free optimization [Martens2010,Martens2011,Sutskever2011].

Surprisingly, gradient descent does not suffer from the curse of dimensionality and could in fact be considered to benefit from many dimensions. Common issues with gradient descent have to do with the gradient getting stuck in local minima and on plateaus where the derivative is zero. However, in spaces of high dimension, these issues are much less common because every dimension increases the possibility of finding a way out.

Nonetheless, the gradient descent algorithm depends on the possibility to compute the partial derivatives at each step. This is only possible when an explicit formula is available for the objective function, which is not always the case.

Next: Black-box optimization and Stochastic optimization