Periodic Reporting for period 1 - OCAL (Optimal Control at Large)
Reporting period: 2018-11-01 to 2020-04-30
Optimal control aims to develop decision making algorithms to extract the maximal benefit out of a dynamical system. Optimal control problems arise in a range of application domains, including energy management problems (where the aim is to meet energy demand with the minimum cost/carbon footprint under constraint imposed by the dynamics of the underlying physical processes), or portfolio optimisation (where the aim could be to maximise return subject to the dynamics and uncertainties of the markers), to name but a few. In the absence of accurate models for the underlying processes, optimal control problems are sometimes treated in a data-driven fashion. This could be in the spirit of reinforcement learning (where optimal decisions are derived by observing the effect of earlier actions and the resulting rewards collected) or of apprenticeship learning (where optimal decisions are derived by observing the actions of an expert). Despite wide-ranging progress on both the theory and applications of optimal control for more than half a century, considerable challenges remain when it comes to applying the resulting methods to large-scale systems. The difficulties become even greater when one moves outside the classical realm of model-based optimal control to address problems where models are replaced by data, or macroscopic behaviours emerge out of microscopic interactions of large populations of agents. To address these challenges, we are developing a framework for approximating optimal control problems using randomised optimisation. The starting point are formulations of optimal control problems as infinite-dimensional linear programs. Our work suggests that randomised methods, on the one hand, can serve as a basis for algorithms to approximate such infinite programs and, on the other, enjoy close connections to statistical learning theory, providing a direct link to data-driven approaches.
Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far
We study infinite-horizon optimal control problems for discrete-time systems when the dynamics and the stage costs are unknown. The general approach we have considered involves exploring the environment and collecting input/output data and costs in a reinforcement learning fashion. We can then cast the acquired information into a linear programming (LP) formulation that returns the approximated optimal Q-function and policy. To deal with potentially very large optimisation problems arising in data-driven optimal control, we exploit the parallel computing architecture of a graphics processing unit to accelerate a state-of-the-art method for solving such optimisation problems. Our open-source CUDA C implementation has been tested on many large-scale problems and was shown to be up to two orders of magnitude faster than the CPU implementation. For deterministic systems that involve no uncertainties we derived a family of data-driven iterative optimisation algorithms based on the LP approach, off-policy Q-learning, and randomised experience replay, that compute a near-optimal feedback policy. To make the process recursive, we developed a data-driven policy iteration scheme that solves a sequence of LPs whose maximum size can be bounded. The “state” of the algorithm is encoded in the set of binding constraints at an optimal solution of the LP. In particular, given the initial data set, we solve the corresponding LP and keep only the data samples associated with the binding constraints. We then add constraints corresponding to new data samples and repeat the procedure. We are also extendeding this to stochastic optimal control problems (SOCPs) posed in the Markov decision process (MDP) formalism. While in classical MDPs an agent aims to minimize a known cost criterion, in many applications one cannot easily specify the cost of a task, but can instead observe an expert behaviour. As a result, it is natural to consider data-driven inverse SOCPs, which consist of inferring a cost criterion from sampled optimal trajectories, as well as the problem of learning a policy that achieves or even surpasses the performance of a policy demonstrated by an expert. We derive approximation schemes that are computationally efficient and at the same time provide explicit probabilistic performance bounds on the quality of the recovered solutions. For SOCPs we also introduced a new contractive operator, the Relaxed Bellman Operator (RBO), that can be used in place of the standard one to build simpler LPs for the stochastic formulation of the problem. In particular, we demonstrate that in the case of linear time-invariant systems the policy we retrieve coincides with the optimal one without approximations, as the RBO preserves the shape of the optimal Q-function. Morevoer, a wide class of real-world applications can be modelled through hybrid dynamics. Even with a perfect model of the dynamics, hybrid problems are very difficult to solve. We considered a sub-class of hybrid systems that are time-invariant and do not have binary states or control inputs. For such systems we develop an algorithm that learns an extended Q-function. Numerical experiments show that our controllers outperform naive implementations of hybrid model predictive control, without having to choose terminal costs or constraints.
Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)
Our relaxed LP formulation leads to a program with half the decision variables of state-of-the-art formulations. This represents a step forward not only in computation, but also in the understanding of approximate dynamic programming problems since it opens the door to the use of different contractive operators to construct LPs with different desired characteristics. We plan to derive suitable theoretical guarantees for such relaxed LPs, show its approximation performance when applied to general nonlinear systems and to what extent they retain the shape preserving property. Moreover, we are exploring new constraint sampling logics to optimize the reinforcement learning process and avoid wasting data. We will further investigate schemes that could ensure monotonic improvement of the policies and explore how state-action trajectories could be used to find an approximate solution to the dual LP. For inverse optimisation and apprenticeship learning, the fundamental difference between our work and existing methods based on linear duality and complementarity, is that in our setting we have the additional difficulty of the infinite-dimensional and data-driven problem. Existing algorithms either come with strong theoretical guarantees, but are computationally expensive, or achieve significant empirical success in challenging benchmark tasks, but are not well understood. A promising future direction is to exploit our characterization of primal-dual optimality for the LP approach as well as the derived corresponding convex-concave saddle point formulation in order to design tractable, model-free primal-dual algorithms with theoretical guarantees. We are currently testing the proposed algorithms on power system and automotive traction control applications and a Snooker-playing robot currently under construction in our lab.