Road to ML Engineer #71 - Evolutionary Algorithms

In the previous article, we discussed the motivations behind neuroevolution and the basic building blocks of evolutionary algorithms (EAs). In this article, we will cover some types of evolutionary algorithms and other approaches to evolutionary computation.

Genetic Algorithms

Genetic algorithms (GAs) are a type of evolutionary algorithm, first introduced in the 1970s, which became one of the most widely used algorithms, especially for discrete or combinatorial optimization problems. GAs represent candidate solutions using appropriate gene representations, such as a weighted adjacency matrix for neural network graphs, and they apply mutations (typically with a fixed and low mutation rate) and crossover operations on the selected individuals. GAs typically use roulette wheel selection (probabilistic selection where better individuals have a higher chance of being selected), tournament selection (selecting the best individual in a randomly selected small group), rank-based selection (selecting ranked individuals with probabilities assigned according to their ranks), and truncated selection (selecting the top-K fittest individuals).

The above visualization shows 20 generations of evolving 200 coordinates on 2D Schaffer and Rastrigin functions using truncated selection with a survival rate of 50%. In this setup, a direct encoding is used, where the genotype is the same as the phenotype. Mutation and crossover (where x and y values are chosen from different randomly chosen parents) were performed with a 90% mutation rate and a 20% crossover rate, and the initial population is sampled from a multivariate Gaussian distribution. The implementation of the simple GA used is available in the Appendix section. We can observe from the visualization that the simple GA converged to the global optimum in 20 generations, although it also suggests the potential for premature convergence in more high-dimensional and complex fitness landscapes.

Evolution Strategies

Evolution strategies (ES) are another widely used type of evolutionary algorithm, coined in 1973, designed for continuous and real-valued numerical optimization. ES represents each individual with real-valued vectors, corresponding to the solution's parameters. ES primarily utilizes truncated selection and either keeps the selected individuals or uses only the offspring in the next generation, which are categorized as ( $\mu + \lambda$ )-ES and ( $\mu, \lambda$ )-ES, respectively. ( $\mu$ represents the selected individuals used to produce offspring, and $\lambda$ denotes the offspring.)

The above visualization shows 20 generations of evolving 200 coordinates on 2D Schaffer and Rastrigin functions using a simple ( $\mu + \lambda$ )-ES with $\mu=1$ . The initial population is sampled from a multivariate Gaussian distribution (with a mean vector [1.0, 2.0] and a covariance matrix whose diagonal values are set to a fixed variance or mutation power of 0.5). The implementation of the simple ES used is available in the Appendix section. We can observe that the simple ES gradually approaches the global maximum by shifting the mean using the best individual while keeping the covariance the same.

CMA-ES

In both the simple GA and ES, we were limited to using fixed distributions, a uniform distribution for the GA and a normal distribution for the ES, to sample perturbations for mutation. This meant we couldn't exploit observations to adjust the extent of mutation during the evolutionary search. Hence, Covariance Matrix Adaptation Evolution Strategy (CMA-ES), as the name suggests, adaptively adjusts the covariance matrix of the noise for better exploitation. Exactly like the mean vector, the covariance matrix is also computed using the selected individuals from the previous generation.

The above visualization shows 20 generations of evolving 200 coordinates on 2D Schaffer and Rastrigin functions using CMA-ES with ( $\mu + \lambda$ ) selection (survival rate is set to 25%). The computation of the mean and covariance is done using functions provided by NumPy, meaning the covariance matrix has a runtime complexity of $O(N^2)$ , where $N$ is the number of variables (in this case, 2). Hence, CMA-ES is commonly used for optimizing a small number of parameters (~10K). The implementation of the CMA-ES used is available in the Appendix section. Here, we can observe that CMA-ES adaptively changes its elliptic shape, though the algorithm is too exploitative, which led to premature convergence to a local maximum for the highly deceptive Rastrigin function.

OpenAI-ES

OpenAI Evolution Strategy (OpenAI-ES) is another technique that makes evolution strategies more exploitative by estimating the gradients of the parameters using samples from a small, fixed multivariate Gaussian distribution and performing gradient ascent. Specifically, we sample $N$ perturbations, $\epsilon$ , from a multivariate Gaussian distribution with zero mean and diagonal variance $\sigma$ to estimate the gradients of $\theta$ by computing $\sum_i^N F(\theta+\sigma\epsilon_i)\epsilon_i/N\sigma$ . Since we do not compute gradients analytically, OpenAI-ES can be applied to non-differentiable and noisy functions. Also, the fitness computations for estimating weights can be parallelized using multiprocessing or vectorization, which makes the algorithm popular for optimizing a large number of parameters.

The above visualization shows 20 generations of evolving 200 coordinates on 2D Schaffer and Rastrigin functions with OpenAI-ES on a mean vector (with a fixed variance). The implementation of the OpenAI-ES used is available in the Appendix section. Instead of mean vector, we can treat randomly chosen initial population members as the model's parameters to optimize, which actually works better than optimizing the mean vector in this case. (Try implementing the optimization of the initial population as a practice and see the results.) In this specific implementation, fitness computations are performed twice, once for selection and once for gradient estimation, to use a smaller variance $\sigma$ , for better gradient estimation. However, because only 100 samples are chosen, and the fitness functions are both deceptive, the algorithm failed to move towards the global maximum.

Other Approaches

In addition to all of the techniques discussed and experimented with, there are other approaches with various different strategies. Non-dominated sorting genetic algorithm II (NSGA-II) is an example of a genetic algorithm used for multi-objective optimization, which involves discovering a diverse set of solutions on the Pareto front (a set of Pareto-optimal solutions where no other feasible solutions can achieve better results in any of the multiple objectives without worsening one, presenting a potential tradeoff between multiple objectives). NSGA-II uses elitism to select the best-ranked individuals from the entire set of parents and offspring to preserve Pareto-optimal solutions, utilizes fast non-dominated sorting (iteratively identifying non-dominated solutions with bookkeeping and ranking them accordingly), and makes use of crowding distance to prefer sparse regions and preserve diversity.

Genetic programming (GP) is another approach explored in neuroevolution, which evolves variable-length, executable tree-structured programs for open-ended discovery of optimal programs. GP is used for optimizing the rules of constructing neural network architectures, hyperparameters, and more in the context of neuroevolution. Cartesian genetic programming (CGP) is an example of genetic programming that represents programs as directed acyclic graphs (DAGs) instead of trees for efficient and simple application of GP on graphs, including neural networks. Evolutionary programming (EP) is a similar but distinct approach to ES, focusing on phenotypic adaptations and typically using stochastic selections.

There are other distinct approaches heavily inspired by nature, like particle swarm optimization (PSO), inspired by the social behaviors of animals (like bird flocking), and ant colony optimization (ACO), for optimizing graph traversal, mimicking real ant colonies finding paths to food sources with pheromone traces. On the other hand, there are more probabilistic and mechanical approaches, like estimation of distribution algorithms (EDAs), which utilize statistical models for more informed and exploitative candidate sampling, and differential evolution (DE), which uses specific variational operators (mutations using three individuals and crossover between mutant and non-mutant vectors) and greedy selection, proven to be simple and effective. As the discussions imply, advancements in evolutionary algorithms themselves are highly likely to impact the field of neuroevolution. Hence, extensive research on evolutionary algorithms is strongly encouraged.

Conclusion

In this article, we discussed GA, ES, CMA-ES, and OpenAI-ES, and briefly introduced other approaches (NSGA-II, GP, CGP, EP, PSO, ACO, EDA, and DE) that can be used in neuroevolution. Although some evolutionary algorithms are designed to perform specific tasks, the effectiveness of EAs on a task at hand is highly sensitive to hyperparameters and is virtually unpredictable, thus requiring experimentation. In the next article, we will finally start evolving neural networks with EAs.

Appendix

The following are Python implementations of evolutionary algorithms and functions, used for creating visualizations in this article. The implementations utilize JAX for fast and parallelizable computations. (I highly recommend trying to implement them yourself to deepen your understanding of these algorithms.)

Resources

Risi, S. et al. 2025. Neuroevolution: Harnessing Creativity in AI Agent Design. Neuroevolution.

Genetic Algorithms

Evolution Strategies

CMA-ES

OpenAI-ES

Other Approaches

Conclusion

Appendix

Simple Genetic Algorithm

Simple Evolution Strategy

CMA-ES

OpenAI-MS

Fitness Functions

Resources