Evolution of Distribution Algorithms (EDAs)

Course completion
88%
$
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\dom}{dom}
\DeclareMathOperator{\sigm}{sigm}
\DeclareMathOperator{\softmax}{softmax}
\DeclareMathOperator{\sign}{sign}
$

EDAs are mathematically principled evolutionary algorithm which do away with the biological analogy. As the name suggests EDAs are based on an unsupervised learning topic: density estimation, which will be reviewed thoroughly in the machine learning section. EDAs achieve state of the art performance in the black-box optimization setting where the goal is to optimize a function $f$ without any knowledge of how $f$ computes its values.

EDAs propose to represent individuals as samples from a probability distribution: the so called proposal distribution. A population is then a set of iid samples from this distribution. In the preceding section, we saw that the mutation and cross-over operators served to define small possible movements around a current population. The EDA approach has the advantage of transforming the problem of moving towards better populations in the input space (which may not be well behaved) to a proxy problem which is usually well behaved: moving towards better proposal distributions.

Formally the algorithm generates a new population by taking $\mu$ samples from a proposal distribution $P_{\theta^{t}}(\mathbf{x})$. These $\mu$ individuals are then ranked according to their $f$-values and the $\sigma$ best individuals are used to update the proposal distribution with a log-likelihood gradient ascent step, i.e.
$$P_{\theta^{t+1}}(\mathbf{x})=P_{\theta^{t}}(\mathbf{x})+\delta\hspace{-1pt}t\nabla\log P_{\theta^{t}}(\mathbf{x})$$
where $\delta\hspace{-1pt}t$ is the step size of the gradient ascent update. The algorithm can then run for a number of steps, until a satisfactory solution is found. The figure below gives an example of EDA update for a Gaussian proposal distribution.

Although the purpose of the algorithm is to maximize $\mathbb{E}\left[f(\mathbf{x})\right]$, it consists in the maximization of a surrogate objective: $\mathbb{E}\left[w(\mathbf{x})\right]$, where the weight function $w(\mathbf{x})$ is equal to $1$ for the $\sigma$ best individuals, and $0$ otherwise. This has the advantage of making the approach invariant w.r.t. monotone transformations of the objective function $f$.

The maximization of the surrogate objective $\mathbb{E}\left[w(\mathbf{x})\right]$ is done with gradient ascent:
$$\nabla\mathbb{E}\left[w(\mathbf{x})\right] = \nabla\int_{\dom f}w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$$
$$ = \int_{\dom f}w(\mathbf{x})\nabla P_{\theta^{t}}(\mathbf{x})$$
$$ = \int_{\dom f}\nabla\log P_{\theta^{t}}(\mathbf{x})w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$$
where taking $\sigma$ samples from $w(\mathbf{x})P_{\theta^{t}}(\mathbf{x})$ can be done by taking samples from $P_{\theta^{t}}(\mathbf{x})$ and keeping only the $\sigma$ best according to $f$.

This general framework can be adapted to optimize functions in $\mathbb{R}^{D}$ e.g. with CMA-ES [Hansen2008], or in discrete spaces such as $\{0,1\}^{D}$ with PBIL [Baluja1994]. It has the advantage of allowing a move towards several proposal solutions at once, contrary to methods such as hill climbing. The PBIL algorithm for optimization over $\{0,1\}^{D}$ is given below.


edas
Figure 5: The steps of an EDA given a fitness function (a) and a Gaussian proposal distribution at time $t$ (b). First, samples are drawn from the proposal distribution (c). The $\sigma$ best samples are then selected according to their $f$-values (d). Finally, the likelihood of the selected samples w.r.t. the parameters (e) can be increased with a step of log-likelihood gradient ascent leading to a new proposal distribution at time $t+1$ (f).

Algorithm (PBIL)
Inputs: $f$, a function with $\dom f=\{0,1\}^{D}$.
$N$, number of proposal samples.
$N_{s}$, number of samples to select at each step.
$\delta\hspace{-1pt}t$, the step-size.
$m$, the probability of a mutation.
$\alpha$, the mutation shift.
Outputs: $\hat{\mathbf{x}}$, approximation of a global minimum.
Variables: $(p_{t,1},p_{t,1},\dots,p_{t,D})$, the parameters of an independent Bernoulli proposal distribution at time $t$: $P_{\theta^{t}}(\mathbf{x})=p_{t,1}^{x_{1}}(1-p_{t,1})^{(1-x_{1})}\times\dots\times p_{t,D}^{x_{D}}(1-p_{t,D})^{(1-x_{D})}$ for $p_{t,i}$ the probability that $x_{i}=1$ at time $t$.
$\mathbf{x}^{(1)},\dots,\mathbf{x}^{(N)}$, the proposal samples at time $t$, not to be confused with $x_{1},\dots,x_{D}$ the components of a vector $\mathbf{x}$.
begin
$p_{0,1},\dots,p_{0,D}:=\frac{1}{2},\dots,\frac{1}{2}$
until satisfied:
    $\mathbf{x}^{(1)}\sim P_{\theta^{t}}(\mathbf{x}),\dots,\mathbf{x}^{(N)}\sim P_{\theta^{t}}(\mathbf{x})$
    rank samples ensuring $\mathbf{x}^{(1)}\leq\dots\leq\mathbf{x}^{(N)}$
    // update probability vector
    for $i$ from $1$ to $N_{s}$:
        for $j$ from $1$ to $D$:
        $p_{t+1,i}:=p_{t,i}\times(1-\delta\hspace{-1pt}t)+x_{j}^{(i)}\times\delta\hspace{-1pt}t$
    // mutate probability vector
    for $j$ from $1$ to $D$:
        if $\mathcal{U}([0;1]) < m$:
        then $p_{t+1,i}:=p_{t+1,i}\times(1-\alpha)+\mathcal{U}(\{0,1\})\times\alpha$
return best solution $\hat{\mathbf{x}}:=\mathbf{x}^{(1)}$.
end

Next: Summary