# Numerical Optimization

This work is in connection with the CoalHMM, coalescent hidden Markov model. CoalHMM is a statistical framework of inferring population genetic parameters. The optimzers are, however, stand-alone subroutines that can be applied to general-purpose numerical optimization tasks.

All four optimizers discussed below are black-box type optimization processes. The first one, Nelder-Mead is deterministic. The rest, genetic algorithm particle swarm, and differential evolution are heuristic-based.

## Nelder-Mead Optimization

Nelder-Mead method, or downhill simplex method, was developed by John Nelder and Roger Mead in 1965 ^{[1]} as a technique to minimize an objective function in a many-dimensional space.

The Nelder-Mead method is an iterative process that continually refines a simplex, which is a polytope of *D+1* vertices in *D* dimensions. During each iteration, the algorithm evaluates the objective function to determine a score at each point in the simplex, see figure below. The point *p _{min}* with the lowest score is reflected through the centroid of the remaining vertices to point

*p*. If the score at

_{r}*p*is neither the highest nor the lowest score, then

_{r}*p*is used in place of

_{r}*p*to form the simplex for the next iteration. If the score at

_{min}*p*is the highest score in the simplex, then this reflected point is expanded away from the centroid to

_{r}*p*and used in place of

_{e}*p*to form the next simplex. If the score at

_{min}*p*is still the lowest score, then

_{r}*p*is contracted toward the centroid to point

_{r}*p*. If the score at

_{c}*p*is no longer the lowest score, then it is used to replace

_{c}*p*to form the next simplex. Otherwise, all points in the simplex shrink around the point

_{min}*p*with the highest score. This process continues until the simplex collapses beyond a predetermined size, a maximum length of time expires, or a maximum number of iterations is reached.

_{max}Nelder-Mead Optimization. An iteration of the Nelder-Mead method over two-dimensional space, showing point *p _{min}* reflected to point

*p*, expanded to point

_{r}*p*, or contracted to point

_{e}*p*. If these test points do not improve the overall score of the simplex, then it shrinks around the point

_{c}*p*with the highest score.

_{max}## Genetic Algorithms

A Genetic Algorithm (GA) is a type of evolutionary algorithm. This optimization technique gained popularity through the work of John Holland in the early 1970s ^{[2]}. It operates by encoding potential solutions as simple chromosome-like data structures and then applying genetic alterations to those structures. Over many iterations, its population of chromosomes evolves toward better solutions, which it determines based on fitness values returned from an objective function. The algorithm typically terminates when the diversity of its population reaches a predetermined minimum, a maximum length of time expires, or a maximum number of iterations has completed.

GAs typically operate in three phases: Selection, Crossover, and Mutation, see the figure below. Selection determines a subset of a population what will breed the next generation of individuals, and a variety of selection schemes exist. In one scheme, Roulette Wheel Selection (RWS) ^{[3]}, the algorithm selects individuals based on their relative fitness within the population; the probability *p _{i}* of selecting an individual

*i*is given by

*p*, where

_{i}= ∑^{N}_{j=i}f_{j}*f*is the fitness of the individual and

_{i}*N*is the population size. While RWS works by repeatedly sampling the population, a variation of RWS, Stochastic Universal Sampling (SUS)

^{[4]}, uses a single random value to sample all breeders by choosing them at evenly spaced intervals; this gives less fit individuals a greater chance to breed. RWS and SUS are both examples of fitness proportionate selection, but other selection schemes are based only on rank, and these are particularly beneficial when the lower and upper bounds of a fitness function are hard to determine. For example, in Tournament Selection

^{[5]}, the algorithm selects an individual with the highest fitness value from a random subset of the population.

Genetic Algorithm. In one iteration of the genetic algorithm’s evolution, it operates in three stages: Selection, where it chooses a relatively fit subset of individuals for breeding; Crossover, where it recombines pairs of breeders to create a new population; and Mutation, where it potentially modifies portions of new chromosomes to help maintain the overall genetic diversity. Arrows in the diagram indicate transitions into the next genetic operation within one generation.

Crossover is a genetic operation used to combine pairs of individuals previously selected for breeding the following generation, and like Selection, several Crossover schemes exist. In One Point Crossover, the algorithm chooses a single point on both parents’ chromosomes, and it forms the child by concatenating all data prior to that point from the first parent with all data after that point from the second parent. In Two Point Crossover, the algorithm instead chooses two points, which splits the parents’ chromosomes into three regions; the algorithm then forms the child by concatenating the first region from the first parent, the second region from the second parent, and the third region from the first parent. In Uniform Crossover, each position on the child’s chromosome has equal opportunity to inherit its data from either parent. While nature serves as the inspiration for One and Two Point Crossover, Uniform Crossover ^{[6]} has no such biological analogue.

Mutation is the third phase in many GAs. Every position on every chromosome has a certain probability to mutate, which helps the population maintain or even improve its genetic diversity. Several variants of this technique exist. In Uniform Mutation ^{[7]}, when a position mutates, the algorithm replaces its value with a new value, chosen at random, between a predetermined lower and upper bound. In another variant, Gaussian Mutation ^{[8]}, when a position mutates, its current value increases or decreases based on a Gaussian random value.

## Particle Swarm Optimization

Particle Swarm Optimization (PSO) is another type of heuristic based search algorithm. Eberhart and Kennedy first discovered and introduced this optimization technique through simulation of a simplified social model in 1995 ^{[9]}. Similar to GAs, PSOs are highly dependent on stochastic processes. Each individual in a PSO population maintains a position and a velocity as it flies through a hyperspace in which each dimension corresponds to one position in an encoded solution. Each individual contains a current position, which evaluates to a fitness value. Each individual also maintains its personal best position **p**_{i} and tracks the global best position **p**_{g} of the swarm, see the figure below. The former encapsulates the cognitive influence, and the latter encapsulates the social influence. A PSO works as an iterative process. After each iteration, the algorithm adjusts the position of each individual based on its knowledge of **p**_{i} and **p**_{g}. This adjustment is analogous to the crossover operation used by GAs. The inertia of an individual, however, allows it to overshoot local minima and explore unknown regions of the problem domain.

Particle Swarm Optimization. Three vectors applied to a particle at position **x**_{i} in one iteration of a Particle Swarm Optimization: a cognitive influence urges the particle toward its previous best **p**_{i}, a social influence urges the particle toward the swarm’s previous best **p**_{g}, and its own velocity **v**_{i} provides inertia, allowing it to overshoot local minima and explore unknown regions of the problem domain.

In PSO, we represent the position of the *i*th particle as **x**_{i} = (*x _{i,1}, x_{i,2}, … x_{i,D}*) and its velocity as

**v**

_{i}= (

*v*), where

_{i,1}, v_{i,2}, … v_{i,D}*D*is the number of dimensions in the parameter space. We represent the particle›s previous position with its best fitness as

**p**

_{i}= (

*p*). During each iteration, the algorithm adjusts the velocity

_{i,1}, p_{i,2}, … p_{i,D}**v**and position

**x**according to the following equations:

**v**^{'}_{i,d} ← **v**_{i,d} + *Φ _{p} ⋅r_{p} ⋅* (

**p**

_{i,d}-

**x**

_{i,d}) +

*Φ*(

_{g}⋅r_{g}⋅**p**

_{g,d}-

**x**

_{i,d})

**x**

^{'}

_{i,d}←

**x**

_{i,d}+

**v**

_{i,d}.

where *r _{p}* and

*r*are two random values between zero and one, and

_{g}*Φ*and

_{p}*Φ*are two positive constants representing cognitive and social influences. As Shi and Eberhart demonstrated

_{g}^{[10]}, it can be beneficial to include a constant

*ω*, which helps balance the global and local search forces. This term directly affects the inertia of the particle.

**v**^{'}_{i,d} ← *ω* ⋅ **v**_{i,d} + *Φ _{p} ⋅r_{p} ⋅* (

**p**

_{i,d}-

**x**

_{i,d}) +

*Φ*(

_{g}⋅r_{g}⋅**p**

_{g,d}-

**x**

_{i,d})

## Differential Evolution

Differential evolution algorithm (DEA) first developed by Storn and Price in 1995 ^{[11]} is less well-known approach in numerical optimization. DEA requires less parameter tuning compared with GA and PSO. After generating and evaluating an initial population, similar to GA and PSO, the solutions are refine as follow. For each individual solution *j* choose three other individuals *k*, *l*, and *m* randomly from the population with (*i* ≠ *k* ≠ *l* ≠ *m*), calculate the difference of the solutions in *k* and *l*, scale it by multiplication with a parameter *f* and create an offspring by adding the result to the solution of *m*. The only additional twist in this process is that not the entire solutions of the offspring is created in this way, but that parameters are partly inherited from individual *j*, such that *o _{i}* is

*m*+

_{i}*f*⋅ (

*k*-

_{i}*l*) if a random number taken in range [0, 1] is less that

_{i}*p*, otherwise

*o*is

_{i}*j*. The proportion is determined by

_{i}*p*, which determines how many parameters of the difference vector on average are copied to the offspring

^{[12]}.

More precisely, the operator iteratively copies consecutive parameters of the difference vector to the offspring until a random number taken in range [0, 1] is greater than or equal to *p*. If the offspring *o* is better than *j* then *j* is replaced by *o*. This process continues untila maximum length of time expires or a maximum number of iterations is reached.

## Rescale Parameter Space

Both GA and PSO require a legal parameter space for the exploration. The optimizers would operate only in this predefined world. It would be unwise to pose tight restrictions on how big or small each parameter can be. We might be able to do so for a rough performance evaluation using simulated data. For real biological applications, though, we have no prior knowledge about the parameters. A good numerical optimizer should perform well with wide parameter ranges. To assist our heuristic optimization algorithms, we perform a series of scaling on the parameters.

Specifically, for each parameter, according to its nature, we define a wide parameter range. We take the log of this range. We then linearly interpolate the log range into interval [0, 1], which is the range that the heuristic optimizers operate on. From the optimizer's perspective, all solutions are encoded in a sequence of values within the range of [0, 1]. When reporting the estimated values, either intermediate or final, we reverse the scaling to obtain each parameter in its original unit.

## References

J. A. Nelder, R. Mead, A simplex method for function minimization, The Computer Journal 7 (4) (1965) 308-313. doi:10.1093/comjnl/7.4.308.

J. H. Holland, Genetic algorithms, Scientific American 267 (1) (1992) 6672.

D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, 1st Edition, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.

J. E. Baker, Reducing bias and inefficiency in the selection algorithm, in: Proceedings of the Second International Conference on Genetic Algorithms on Genetic algorithms and their application, L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1987, pp. 14-21.

B. L. Miller, B. L. Miller, D. E. Goldberg, D. E. Goldberg, Genetic algorithms, tournament selection, and the effects of noise, Complex Systems 9 (1995) 193-212.

G. Syswerda, Uniform crossover in genetic algorithms, in: J. D. Schaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms, Morgan Kaufmann, 1989, pp. 2-9.

Z. Michalewicz, Genetic algorithms + data structures = evolution programs (1996).

K. Deb, 2001, multiobjetive optimization using evolutionary algorithms (2001).

R. Eberhart, J. Kennedy, A new optimizer using particle swarm theory, in: Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1995., 1995, pp. 39-43. doi:10.1109/MHS.1995. 494215.

Y. Shi, R. Eberhart, A modified particle swarm optimizer, in: Evolutionary Computation Proceedings, 1998. IEEE World Congress on Computational Intelligence., The 1998 IEEE International Conference on, 1998, pp. 69-73. doi:10.1109/ICEC.1998.699146.

Storn, R. and Price, K., Differential Evolution - a Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces, Technical Report TR-95-012, ICSI, March 1995, ftp.icsi.berkeley.edu. Anyone who is interested in trying Differential Evolution (DE)

S. Paterlini and T. Krink, Differential evolution and particle swarm optimisation in partitional clustering, Comput. Stat. Data Anal., vol. 50, no. 5, pp. 1220-1247, Mar. 2006.