Evolution of Distribution Algorithms (EDAs)

Course completion
88%

EDAs are mathematically principled evolutionary algorithm which do away with the biological analogy. As the name suggests EDAs are based on an unsupervised learning topic: density estimation, which will be reviewed thoroughly in the machine learning section. EDAs achieve state of the art performance in the black-box optimization setting where the goal is to optimize a function f without any knowledge of how f computes its values.

EDAs propose to represent individuals as samples from a probability distribution: the so called proposal distribution. A population is then a set of iid samples from this distribution. In the preceding section, we saw that the mutation and cross-over operators served to define small possible movements around a current population. The EDA approach has the advantage of transforming the problem of moving towards better populations in the input space (which may not be well behaved) to a proxy problem which is usually well behaved: moving towards better proposal distributions.

Formally the algorithm generates a new population by taking μ samples from a proposal distribution Pθt(x). These μ individuals are then ranked according to their f-values and the σ best individuals are used to update the proposal distribution with a log-likelihood gradient ascent step, i.e.
Pθt+1(x)=Pθt(x)+δtlogPθt(x)
where δt is the step size of the gradient ascent update. The algorithm can then run for a number of steps, until a satisfactory solution is found. The figure below gives an example of EDA update for a Gaussian proposal distribution.

Although the purpose of the algorithm is to maximize E[f(x)], it consists in the maximization of a surrogate objective: E[w(x)], where the weight function w(x) is equal to 1 for the σ best individuals, and 0 otherwise. This has the advantage of making the approach invariant w.r.t. monotone transformations of the objective function f.

The maximization of the surrogate objective E[w(x)] is done with gradient ascent:
E[w(x)]=domfw(x)Pθt(x)
=domfw(x)Pθt(x)
=domflogPθt(x)w(x)Pθt(x)
where taking σ samples from w(x)Pθt(x) can be done by taking samples from Pθt(x) and keeping only the σ best according to f.

This general framework can be adapted to optimize functions in RD e.g. with CMA-ES [Hansen2008], or in discrete spaces such as {0,1}D with PBIL [Baluja1994]. It has the advantage of allowing a move towards several proposal solutions at once, contrary to methods such as hill climbing. The PBIL algorithm for optimization over {0,1}D is given below.


edas
Figure 5: The steps of an EDA given a fitness function (a) and a Gaussian proposal distribution at time t (b). First, samples are drawn from the proposal distribution (c). The σ best samples are then selected according to their f-values (d). Finally, the likelihood of the selected samples w.r.t. the parameters (e) can be increased with a step of log-likelihood gradient ascent leading to a new proposal distribution at time t+1 (f).

Algorithm (PBIL)
Inputs: f, a function with domf={0,1}D.
N, number of proposal samples.
Ns, number of samples to select at each step.
δt, the step-size.
m, the probability of a mutation.
α, the mutation shift.
Outputs: x^, approximation of a global minimum.
Variables: (pt,1,pt,1,,pt,D), the parameters of an independent Bernoulli proposal distribution at time t: Pθt(x)=pt,1x1(1pt,1)(1x1)××pt,DxD(1pt,D)(1xD) for pt,i the probability that xi=1 at time t.
x(1),,x(N), the proposal samples at time t, not to be confused with x1,,xD the components of a vector x.
begin
p0,1,,p0,D:=12,,12
until satisfied:
    x(1)Pθt(x),,x(N)Pθt(x)
    rank samples ensuring x(1)x(N)
    // update probability vector
    for i from 1 to Ns:
        for j from 1 to D:
        pt+1,i:=pt,i×(1δt)+xj(i)×δt
    // mutate probability vector
    for j from 1 to D:
        if U([0;1])<m:
        then pt+1,i:=pt+1,i×(1α)+U({0,1})×α
return best solution x^:=x(1).
end

Next: Summary