Skip to main content
Ir a la página de inicio de la Comisión Europea (se abrirá en una nueva ventana)
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS

Reliable Data-Driven Decision Making in Cyber-Physical Systems

Periodic Reporting for period 4 - RADDICS (Reliable Data-Driven Decision Making in Cyber-Physical Systems)

Período documentado: 2023-07-01 hasta 2024-06-30

Fueled by an exponential growth in data and compute power, in recent years, we have witnessed breathtaking advances in the field of machine learning (ML). This has led to a technology race in industry, with ML models being adopted in an ever increasing array of applications. As a consequence, there is a lot of enthusiasm in deploying ML models in increasingly high-stakes applications, such as self-driving cars, medical applications etc., where ML algorithms are used to autonomously make decisions in the real world. This results in a departure from the most well studied realm of ML, namely supervised learning (where the goal is to learn to predict), and enters the domain of reinforcement learning (RL, where the goal is to learn to act). This abstract model considers an agent who seeks to make decisions in an uncertain world. Since the agent does not know how the world works, it faces the dilemma of trading exploration – conducting experiments to better understand the consequences of its actions and associated rewards – and exploitation – using what it learned to make effective decisions. Similar to supervised learning RL has experienced dramatic breakthroughs in recent years, including DeepMind’s AlphaGo’s & AlphaZero’s landmark victories in the game of go. Strikingly, most of these successes are in games: Perfectly controlled environments, where – given enough computational power – virtually unlimited exploration is possible. In most real-world systems, characterized by high complexity and large amounts of uncertainty, however, at best approximate simulators are available. As a consequence, learning has to happen, at least partially, based on observations from real, physical systems. Suddenly the notion of exploration becomes a dangerous proposition: It means experimenting with actions whose consequences are uncertain. This fact disqualifies most existing approaches for RL, which utilize unconstrained – and possibly unsafe – exploration.

The RADDICS project developed novel RL algorithms that are provably reliable, even when deployed on high-stakes applications. Our approach hinges upon marrying nonparametric statistical learning with robust optimization. In particular, we use Bayesian approaches to quantify uncertainty in the prediction, in a way that yields valid high-probability confidence estimates about the unknown dynamics and rewards, even under some possibly adversarial circumstances. We then act safely under all plausible models, by employing tools from robust optimization and control theory. Additional observations contract the posterior, allowing to learn and improve policies over time in a safe manner. Beyond developing new algorithms and theory, we demonstrated our approach on several real-world applications, ranging from robotics to scientific applications such as safely tuning the SwissFEL Free Electron Laser, a large scientific facility operated by the Paul Scherrer Institute.
The RADDICS project has been successful across all four research threads. For example, we have discovered a novel approach for high-dimensional safe Bayesian optimization, which we already demonstrated on one of the RADDICS application domains (the SwissFEL Free Electron Laser at the Paul Scherrer Institute), where it yielded considerable performance improvements compared to prior work. In addition, we have made substantial progress in extending our approaches towards Deep Bayesian models. For example, we have developed a novel approach towards learning priors for Gaussian process models parametrized through deep neural networks, based on meta-learning. Our approach, called PACOH, is computationally attractive, and yields PAC-Bayesian generalization bounds. We have already demonstrated its utility on sequential tasks such as Bayesian optimization.

We have also made generalised insights from Bayesian optimization and model-free RL to model-based RL, allowing to scale to much more complex control problems in a data efficient manner. In particular, we have been able to extend our previous SafeMDP approach from pure safe exploration to accommodate the exploration—exploitation dilemma. The key insight is to reconsider the notion of expansion of the safe set, in a goal-directed fashion. As a key contribution, we developed the H-UCRL algorithms, the first practical approach towards model-based deep RL based on probabilistic confidence bounds. As a central feature, it employs a reparametrization idea that allows utilising state of the art deep policy-gradient algorithms to solve the introspective planning problems in the model-based RL loop. Beyond H-UCRL, we generalised this technique to robust, constrained and multi-agent RL, demonstrating its generality and flexibility.

Lastly, based on the principle of Information-Directed Sampling, we have been able to extend our confidence-based Bayesian optimization approach towards a rich family of partially observed decision tasks, called Partial Monitoring. Such tasks provide a natural abstraction of multi-fidelity optimization or preference-based optimization, of central relevance to the RADDICS project.

These results have led to publications at premier machine learning conferences and journals such as NeurIPS, ICML, AISTATS, COLT, ICLR, IJCAI, JMLR etc., as well as to several invited talks and tutorials.
Our results go significantly beyond the state of the art, substantially improving the reliability of Bayesian optimization and Reinforcement Learning approaches. Our work on high-dimensional safe Bayesian optimization allows us to tackle much more complex problems than prior work. Our meta-learning work on PACOH yields state of the art predictive performance, in particular regarding uncertainty quantification, which is of central importance for sequential decision making tasks. Our work on Information Directed Sampling resulted in the first approach towards Bayesian Optimization from comparisons with sublinear regret bounds. Our work on model-based reinforcement learning allows to generalise results from Bayesian optimization to much more complex settings, while being amenable to standard deep policy gradient algorithms.
Mi folleto 0 0