Skip to main content

Coevolutionary Policy Search

Periodic Reporting for period 3 - CoPS (Coevolutionary Policy Search)

Reporting period: 2018-10-01 to 2020-03-31

In this project, we are developing a new class of decision-theoretic planning methods that overcome fundamental obstacles to the efficient optimization of autonomous agents. Creating agents that are effective in diverse settings is a key goal of artificial intelligence with enormous potential implications: robotic agents would be invaluable in homes, factories, and high-risk settings; software agents could revolutionize e-commerce, information retrieval, and traffic control.

The main challenge lies in specifying an agent’s policy: the behavioral strategy that determines its actions. Since the complexity of realistic tasks makes manual policy construction hopeless, there is great demand for decision-theoretic planning methods that automatically discover good policies. Despite enormous progress, the grand challenge of efficiently discovering effective policies for complex tasks remains unmet.

A fundamental obstacle is the cost of policy evaluation: estimating a policy’s quality by averaging performance over multiple trials. This cost grows quickly with increases in task complexity (making trials more expensive) or stochasticity (necessitating more trials).

To address this difficulty, we are developing new approaches that simultaneously optimize both policies and the manner in which those policies are evaluated. The key insight is that, in many tasks, many trials are wasted because they do not elicit the controllable rare events critical for distinguishing between policies. Thus, we are developing methods that leverage coevolution to automatically discover the best events, instead of sampling them randomly.
These methods have the potential to greatly improve the efficiency of decision-theoretic planning and, in turn, help realize the potential of autonomous agents. In addition, by automatically identifying the most useful events, the resulting methods are helping isolate critical factors in performance and thus yield new insights into what makes decision-theoretic problems hard.
So far, substantial progress has been made in developing new robust reinforcement learning methods that cope with the presence of significant rare events that occur with low probability but nonetheless have a big effect on expected performance.

Progress has been made on two fronts: developing frequentist approaches to this problem, i.e. those based on frequentist statistics, as outlined in WP1 and WP2; and developing Bayesian approaches to this problem, i.e. those based on Bayesian statistics, as outlined in WP3.

On the frequentist side, we developed a new method called off-environment reinforcement learning (OFFER) which takes a coevolutionary approach to policy search. A policy gradient method is used to optimise the policy while a second optimisation process in parallel optimises a proposal distribution from which trajectories are sampled, in order to ensure sufficient focus on significant rare events. This dual optimisation approach carries out the coevolutionary approach sketched in WP1. The resulting paper, which presented both theoretical and empirical results, was published at AAAI in 2016.

Furthermore, we have extended this work by devising a fundamental improvement to the policy gradient approach that underlies OFFER. The new approach, called expected policy gradients (EPG), marginalises out the stochastically selected action, and can be used to speed up both halves of OFFER’s dual optimisation. We have already obtained strong theoretical results for EPG as well as substantial empirical results. The paper was published in AAAI in 2018. The work on EPG has also inspired a new approach we call Fourier policy gradients, which uses Fourier analysis to unify a family of policy gradient methods and construct new methods as well. This work was recently accepted for publication at ICML in 2018.

Building on this momentum, we have also developed multi-agent variants of these frequentist approaches, allowing teams of agents to efficiently optimise their joint control policies by exploiting these same principles. This is a natural extension since coevolution takes an inherently multi-agent perspective on optimisation. This work has resulted in a paper at NIPS in 2016, a paper at ICML in 2017, a paper at AAAI in 2018 that won the Outstanding Student Paper Award, and a paper recently accepted at ICML in 2018.

On the Bayesian side, we developed a new method called alternating optimisation and quadrature (ALOQ), which uses Bayesian optimisation and Bayesian quadrature to tackle the same problem as OFFER. In each optimisation step, Bayesian optimisation is used to select a policy to evaluate, and Bayesian quadrature is used to sample the trajectory on which that policy will be evaluated. We obtained strong empirical results for ALOQ on a number of benchmark tasks, substantially outperforming existing methods. In addition, in collaboration with roboticists in France, we applied ALOQ to challenging simulated robot control tasks such as the control of a robot arm with multiple joints and the control of a walking hexapod robot. The resulting paper was published at AAAI in 2018.

Continuing our collaboration with the roboticists, we are have been working to apply ALOQ to physical robots and show that its robustness leads to performance advantages. This has been written up as an extension of the AAAI-18 paper into a journal article submitted to a JLMR special issue on Bayesian optimisation.

We are also developing a successor method to ALOQ that we call contextual policy optimisation that automatically selects a distribution over environment variables that enables a policy gradient method to maximise its one-step performance. This paper was recently submitted for publication to NIPS 2018.
The work described above encapsulates significant progress beyond the state of the art. These advances include:

1. Efficient frequentist methods for learning robust policies in the presence of rare events.

2. Efficient Bayesian methods for doing the same.

3. Fundamental advances in the analysis of policy gradient methods and new practical methods inspired by that analysis.

4. Major advances in the optimisation of cooperative multi-agent systems using such policy gradient methods.