Skip to main content
European Commission logo
English English
CORDIS - EU research results
CORDIS

Coevolutionary Policy Search

Periodic Reporting for period 4 - CoPS (Coevolutionary Policy Search)

Reporting period: 2020-04-01 to 2021-09-30

In this project, we developed a new class of decision-theoretic planning methods that overcome fundamental obstacles to the efficient optimization of autonomous agents. This is a step towards creating agents that are effective in diverse settings, a key goal of artificial intelligence with enormous potential implications: robotic agents would be invaluable in homes, factories, and high-risk settings; software agents could revolutionize e-commerce, information retrieval, and traffic control.

The main challenge is in specifying an agent’s policy: the behavioral strategy that determines its actions. Since the complexity of realistic tasks makes manual policy construction hopeless, there is great demand for decision-theoretic planning methods that automatically discover good policies. Despite enormous progress, the grand challenge of efficiently discovering effective policies for complex tasks remains unmet.

A fundamental obstacle is the cost of policy evaluation: estimating a policy’s quality by averaging performance over multiple trials. This cost grows quickly with increases in task complexity (making trials more expensive) or stochasticity (necessitating more trials).

To address this difficulty, we developed new approaches that simultaneously optimize both policies and the manner in which those policies are evaluated. The key insight is that, in many tasks, many trials are wasted because they do not elicit the controllable rare events critical for distinguishing between policies. Thus, we developed methods that leverage coevolution to automatically discover the best events, instead of sampling them randomly. These methods have the potential to greatly improve the efficiency of decision-theoretic planning and, in turn, help realize the potential of autonomous agents. In addition, by automatically identifying the most useful events, the resulting methods helped to isolate critical factors in performance and thus yielded new insights into what makes decision-theoretic problems hard.
In this project, we made great strides in developing new robust reinforcement learning methods that cope with the presence of significant rare events that occur with low probability but nonetheless have a big effect on expected performance. In addition, we also developed efficient techniques for closely related challenges in multi-agent and meta reinforcement learning problems.

Progress was made on two fronts: developing frequentist approaches to these problems, i.e. those based on frequentist statistics, as outlined in WP1 and WP2; and developing Bayesian approaches to these problems, i.e. those based on Bayesian statistics, as outlined in WP3.

On the frequentist side, we developed a new method called off-environment reinforcement learning (OFFER) which takes a coevolutionary approach to policy search. A policy gradient method is used to optimise the policy while a second optimisation process in parallel optimises a proposal distribution from which trajectories are sampled, in order to ensure sufficient focus on significant rare events. This dual optimisation approach carries out the coevolutionary approach sketched in WP1.

Furthermore, we extended this work by devising a fundamental improvement to the policy gradient approach that underlies OFFER. The new approach, called expected policy gradients (EPG), marginalises out the stochastically selected action, and can be used to speed up both halves of OFFER’s dual optimisation. We have obtained strong theoretical results for EPG as well as substantial empirical results. The work on EPG also inspired a new approach we call Fourier policy gradients, which uses Fourier analysis to unify a family of policy gradient methods and construct new methods as well.

Building on this momentum, we have also developed multi-agent variants of these frequentist approaches, allowing teams of agents to efficiently optimise their joint control policies by exploiting these same principles. This is a natural extension since coevolution takes an inherently multi-agent perspective on optimisation. We also developed universal value exploration, which leverages key ideas from successor features in reinforcement learning to perform more efficient and scalable exploration in multi-agent learning. They also include tensorised actors, which exploit tools from tensor decomposition (high-dimensional variants of matrix factorisation techniques) to improve multi-agent reinforcement learning. Furthermore, we developed a technique for regularised softmax which addresses a systematic overestimation bias in multi-agent reinforcement learning methods. Finally, we developed factored value functions for multi-agent learning by proposing deep coordination graphs as well as factored centralised critics to learn faster value functions that generalise better.

On the Bayesian side, we developed a new method called alternating optimisation and quadrature (ALOQ), which uses Bayesian optimisation and Bayesian quadrature to tackle the same problem as OFFER. In each optimisation step, Bayesian optimisation is used to select a policy to evaluate, and Bayesian quadrature is used to sample the trajectory on which that policy will be evaluated. We obtained strong empirical results for ALOQ on a number of benchmark tasks, substantially outperforming existing methods. In addition, in collaboration with roboticists in France, we applied ALOQ to challenging simulated robot control tasks such as the control of a robot arm with multiple joints and the control of a walking hexapod robot. We also developed a successor method to ALOQ that we call contextual policy optimisation that automatically selects a distribution over environment variables that enables a policy gradient method to maximise its one-step performance. We also developed Bayesian Bellman operators, a new approach to Bayesian reinforcement learning that avoids the need to learn a model while fixing foundational problems with existing model-free Bayesian reinforcement learning methods.
The work described above encapsulates significant progress beyond the state of the art. These advances include:

1. Efficient frequentist methods for learning robust policies in the presence of rare events.

2. Efficient Bayesian methods for doing the same.

3. Fundamental advances in the analysis of policy gradient methods and new practical methods inspired by that analysis.

4. Major advances in the optimisation of cooperative multi-agent systems using such policy gradient methods.
Gaussian Process used in ALOQ method.