Optimal Control at Large

Optimal control aims to develop decision making algorithms to extract the maximal benefit out of a dynamical system. Optimal control problems arise in a range of application domains, including energy management problems (where the aim is to meet energy demand with the minimum cost/carbon footprint under constraints imposed by the dynamics of the underlying physical processes), or portfolio optimisation (where the aim could be to maximise return subject to the dynamics and uncertainties of the markets), to name but a few. In the absence of accurate models for the underlying processes, optimal control problems are sometimes treated in a data-driven fashion. This could be in the spirit of reinforcement learning (where optimal decisions are derived by observing the effect of earlier actions and the resulting rewards) or Model Predictive Control (where optimal decisions are derived by using historical system data in lieu of a model). Despite wide-ranging progress on both the theory and applications of optimal control for more than half a century, considerable challenges remain when it comes to applying the resulting methods to large-scale systems. The difficulties become even greater when one moves outside the classical realm of model-based optimal control to address problems where models are replaced by data, or macroscopic behaviours emerge out of microscopic interactions of large populations of agents. To address these challenges, we are developing a framework for approximating optimal control problems using randomised optimisation. The starting point are formulations of optimal control problems as infinite-dimensional linear programs. Our work suggests that randomised methods on the one hand can serve as a basis for algorithms to approximate such infinite programs and, on the other, enjoy close connections to statistical learning theory, providing a direct link to data-driven approaches.

The general approach we have considered involves exploring the environment and collecting input/output data and costs in a reinforcement learning fashion. We can then use the acquired information to formulate a linear program (LP) that returns an approximate optimal policy, encoded through the so-called Q-function. For deterministic systems that involve no uncertainties, we derived a family of data-driven iterative optimisation algorithms based on this LP approach, off-policy Q-learning, and randomised experience replay, that compute a near-optimal feedback policy. To make the process recursive, we developed a data-driven policy iteration scheme that solves a sequence of LPs whose maximum size can be bounded; here the “state” of the algorithm is encoded in the set of binding constraints at the optimiser of the LP. We are also extending this approach to stochastic optimal control problems (SOCPs) posed in the Markov decision process (MDP) formalism. While in classical MDPs an agent aims to minimise a known cost criterion, in many applications one cannot easily specify the cost of a task, but can instead observe an expert behaviour. As a result, it is natural to consider data-driven inverse SOCPs, which consist of inferring a cost criterion from sampled optimal trajectories, as well as the problem of learning a policy that achieves or even surpasses the performance of a policy demonstrated by an expert. We derive approximation schemes that are computationally efficient and at the same time provide explicit probabilistic performance bounds on the quality of the recovered solutions. For SOCPs we also introduced a new contractive operator, the Relaxed Bellman Operator (RBO), that can be used to build simpler LPs. In particular, we demonstrate that in the case of linear time-invariant stochastic systems and for all deterministic systems the policy we retrieve coincides with the optimal one without approximation. For model based SOCP we provide a precise interpretation of dynamic programming algorithms as instances of semi-smooth Newton type methods; this opens the door to the development of novel algorithms, but also the deployment of advanced numerical solvers to improve scalability. To deal with potentially very large optimisation problems, we exploit the parallel computing architecture of GPU; our open-source CUDA C implementation has been tested on many large-scale problems and was shown to be up to two orders of magnitude faster than the CPU implementation; we are currently pursuing a similar paralellisation effort for our Newton-type methods.

In a parallel stream, we have been developing methods for removing the "M" in MPC. Model Predictive Control (MPC) methods are very popular in industry and academia, but their reliance on a model sometimes hampers their deployment in settings where models are difficult to obtain and maintain; an example is energy management in buildings and districts. We have been working on Data Enabled Predictive Control (DeePC) methods for replacing the model in the optimisation problem solved by MPC through constraints based on data. They key challenge we are addressing here is dealing with systems that are subject to uncertainty; the key ingredient is appropriate regularisation based on methods from stochastic programming and robust optimisation. Finally, to bridge the gap between the two classes of methods, we have developed methods for learning multi-step Q-functions for hybrid systems and using them in an MPC setting.

Our relaxed LP formulation leads to a program with half the decision variables of state-of-the-art formulations. This represents a step forward not only in computation, but also in the understanding of approximate dynamic programming problems since it opens the door to the use of different contractive operators to construct LPs with different desired characteristics. We plan to derive suitable theoretical guarantees for such relaxed LPs, quantify their approximation performance when applied to general nonlinear systems, and investigate to what extent they preserve the shape of the Q-functions (hence the optimal policy). Moreover, we are exploring new constraint sampling logics and Newton-like algorithms to optimise the reinforcement learning process and avoid wasting data. We will further investigate schemes that could ensure monotonic improvement of the policies and explore how state-action trajectories could be used to find an approximate solution to the dual LP. For inverse optimisation and apprenticeship learning, the fundamental difference between our work and existing methods based on linear duality and complementarity, is that in our setting we have the additional difficulty of the infinite-dimensional and data-driven problem. Existing algorithms either come with strong theoretical guarantees, but are computationally expensive, or achieve significant empirical success in challenging benchmark tasks, but are not well understood from a theory perspective. Our work, building on our characterisation of primal-dual optimality for the LP approach, derived a convex-concave saddle point formulation leading to tractable, model-free primal-dual algorithms with theoretical guarantees. We are currently testing the proposed algorithms on power system and automotive traction control applications and a Snooker-playing robot under development in our lab.

Periodic Reporting for period 3 - OCAL (Optimal Control at Large)

Share this page

Download