Periodic Reporting for period 4 - RADDICS (Reliable Data-Driven Decision Making in Cyber-Physical Systems)
Período documentado: 2023-07-01 hasta 2024-06-30
The RADDICS project developed novel RL algorithms that are provably reliable, even when deployed on high-stakes applications. Our approach hinges upon marrying nonparametric statistical learning with robust optimization. In particular, we use Bayesian approaches to quantify uncertainty in the prediction, in a way that yields valid high-probability confidence estimates about the unknown dynamics and rewards, even under some possibly adversarial circumstances. We then act safely under all plausible models, by employing tools from robust optimization and control theory. Additional observations contract the posterior, allowing to learn and improve policies over time in a safe manner. Beyond developing new algorithms and theory, we demonstrated our approach on several real-world applications, ranging from robotics to scientific applications such as safely tuning the SwissFEL Free Electron Laser, a large scientific facility operated by the Paul Scherrer Institute.
We have also made generalised insights from Bayesian optimization and model-free RL to model-based RL, allowing to scale to much more complex control problems in a data efficient manner. In particular, we have been able to extend our previous SafeMDP approach from pure safe exploration to accommodate the exploration—exploitation dilemma. The key insight is to reconsider the notion of expansion of the safe set, in a goal-directed fashion. As a key contribution, we developed the H-UCRL algorithms, the first practical approach towards model-based deep RL based on probabilistic confidence bounds. As a central feature, it employs a reparametrization idea that allows utilising state of the art deep policy-gradient algorithms to solve the introspective planning problems in the model-based RL loop. Beyond H-UCRL, we generalised this technique to robust, constrained and multi-agent RL, demonstrating its generality and flexibility.
Lastly, based on the principle of Information-Directed Sampling, we have been able to extend our confidence-based Bayesian optimization approach towards a rich family of partially observed decision tasks, called Partial Monitoring. Such tasks provide a natural abstraction of multi-fidelity optimization or preference-based optimization, of central relevance to the RADDICS project.
These results have led to publications at premier machine learning conferences and journals such as NeurIPS, ICML, AISTATS, COLT, ICLR, IJCAI, JMLR etc., as well as to several invited talks and tutorials.