The general approach we have considered involves exploring the environment and collecting input/output data and costs in a reinforcement learning fashion. We can then use the acquired information to formulate a linear program (LP) that returns an approximate optimal policy, encoded through the so-called Q-function. For deterministic systems that involve no uncertainties, we derived a family of data-driven iterative optimisation algorithms based on this LP approach, off-policy Q-learning, and randomised experience replay, that compute a near-optimal feedback policy. To make the process recursive, we developed a data-driven policy iteration scheme that solves a sequence of LPs whose maximum size can be bounded; here the “state” of the algorithm is encoded in the set of binding constraints at the optimiser of the LP. We are also extending this approach to stochastic optimal control problems (SOCPs) posed in the Markov decision process (MDP) formalism. While in classical MDPs an agent aims to minimise a known cost criterion, in many applications one cannot easily specify the cost of a task, but can instead observe an expert behaviour. As a result, it is natural to consider data-driven inverse SOCPs, which consist of inferring a cost criterion from sampled optimal trajectories, as well as the problem of learning a policy that achieves or even surpasses the performance of a policy demonstrated by an expert. We derive approximation schemes that are computationally efficient and at the same time provide explicit probabilistic performance bounds on the quality of the recovered solutions. For SOCPs we also introduced a new contractive operator, the Relaxed Bellman Operator (RBO), that can be used to build simpler LPs. In particular, we demonstrate that in the case of linear time-invariant stochastic systems and for all deterministic systems the policy we retrieve coincides with the optimal one without approximation. For model based SOCP we provide a precise interpretation of dynamic programming algorithms as instances of semi-smooth Newton type methods; this opens the door to the development of novel algorithms, but also the deployment of advanced numerical solvers to improve scalability. To deal with potentially very large optimisation problems, we exploit the parallel computing architecture of GPU; our open-source CUDA C implementation has been tested on many large-scale problems and was shown to be up to two orders of magnitude faster than the CPU implementation; we are currently pursuing a similar paralellisation effort for our Newton-type methods.
In a parallel stream, we have been developing methods for removing the "M" in MPC. Model Predictive Control (MPC) methods are very popular in industry and academia, but their reliance on a model sometimes hampers their deployment in settings where models are difficult to obtain and maintain; an example is energy management in buildings and districts. We have been working on Data Enabled Predictive Control (DeePC) methods for replacing the model in the optimisation problem solved by MPC through constraints based on data. They key challenge we are addressing here is dealing with systems that are subject to uncertainty; the key ingredient is appropriate regularisation based on methods from stochastic programming and robust optimisation. Finally, to bridge the gap between the two classes of methods, we have developed methods for learning multi-step Q-functions for hybrid systems and using them in an MPC setting.