Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Provably Efficient Algorithms for Large-Scale Reinforcement Learning

Periodic Reporting for period 2 - SCALER (Provably Efficient Algorithms for Large-Scale Reinforcement Learning)

Reporting period: 2023-04-01 to 2024-09-30

The project aims to address the lack of theoretical guarantees in a key area of artificial intelligence research called reinforcement learning. Until the beginning of the project, the majority of progress in this area has been empirical, without any theoretical guarantees whatsoever. Indeed, theoretical guarantees have been restricted to small problems of no practical interest, thus hindering the applicability of this technology. The project aims to extend these theoretical guarantees to large-scale scenarios capturing more applications than has been possible before, thus making RL-based learning systems safer and more predictable for use in the real world.
The main objective of the project is to advance the state of the art in the theory of reinforcement learning (RL). The proposal has identified a number of potential techniques for contributing to this area, organized into two main threads: A) developing RL methods that can effectively work with value-function classes, and B) developing a complementary methodology based on linear-programming relaxations. During the first half of the funding period, several core contributions have been made by the team along the proposed directions. A total of 16 papers have been published in top venues of RL theory research. A summary of the main scientific contributions is below:

1. Stochastic primal-dual methods for RL. The research plan outlined in Thread B (particularly WP4) has resulted in a sequence of papers developing better and better RL algorithms using tools from constrained optimization. The most notable achievement on this front is the recent preprint "Offline RL via Feature-Occupancy Gradient Ascent", which has perfected the technique to the level that it has resulted in the best currently known algorithm for offline RL in an important class of problems (infinite-horizon linear MDPs). This work builds on our own previous works published at ALT 2023 and AISTATS 2024, as well as concurrent results of Hong and Tewari (ICML 2024) that also builds directly on the same previous works.

2. Optimal transport for Markov chains. In the recent preprint "Bisimulation metrics are optimal transport distances, and can be computed efficiently", we have developed a new framework for computing optimal transport (OT) distances between stochastic processes. This work consolidates decades of research in a variety of areas of computer science, mathematical logic, and probability theory by showing that the successful notion of "bisimulation metrics" used for comparing stochastic processes in some of these communities is in fact an OT distance, and as such techniques from this seemingly unrelated field can be used for efficient computation of distances. By reformulating the resulting OT problem as a linear program (in the vein of the research plan posited in Thread B), we provide a range of new, computationally efficient algorithms for calculating distances between Markov chains, and advocate for the use of these distances for representation learning. The resulting methods are already orders of magnitude faster than the best previously known algorithms, but these are really just the first steps: the new framework opens up many potential directions for future research.

3. Online-to-PAC conversions. In the preprint "Online-to-PAC conversions: Generalization bounds via regret analysis", the PI proposes a new analytic framework for studying the generalization error of statistical learning algorithms. At a high level, the main result in this work shows how a purely statistical question (of uncertainty quantification) can be reduced to a purely algorithmic question (of regret analysis). Among the numerous applications of this result, the potential for advancing the state of the art in RL theory is most relevant to this project. In particular, confidence intervals achieved through the online-to-PAC framework can potentially result in improved methods for exploration-exploitation in online RL and better out-of-sample generalization guarantees for offline RL.

4. DEC and information-directed sampling. The DEC framework of Foster, Kakade, Qian and Rakhlin (2021) has had major impact on the field of sequential decision making. Their work provides a complete characterization of the worst-case performance of learning algorithms in terms of a single quantity called the DEC. The worst-case nature of this characterization is, however, also a limitation of their framework: real-world problems are often far from the worst-case scenarios considered in their analysis, and thus more refined versions of their results are necessary to make them more practical. In the recently published paper "Optimistic Information-Directed Sampling", we provide a more flexible variant of the DEC characterization, which is able to achieve performance guarantees that are able to adapt to the structure of the problem at hand. In particular, we develop an algorithm that can provably improve on the worst-case DEC lower bounds in a way similar to the classic "information-directed sampling" approach of Russo & Van Roy (2018), but without having to make the restrictive Bayesian assumptions that this latter work relied on.
All the above results above progressed the state of the art in important areas of reinforcement learning. In the remaining period, we will continue to build on the above results along the lines explained in the description of action. Besides the directions explained in said document, the progress in the first period has lead to some discoveries that will have a strong influence on how the rest of the project will be shaping up. In particular, the newly developed framework for optimal transport between Markov chains is likely going to have a large impact on the research plan for the remaining period, in particular due to its potential usefulness for representation learning in sequential decision making. Furthermore, the newly developed methodology for deriving concentration bounds is expected to influence the future work of the team, primarily when it comes to uncertainty quantification in realistic RL problems with high-dimensional states. We expect that these techniques will enable us to address a class of large-scale reinforcement learning problems that have been so far out of reach for traditional RL theory.
ScaleR.png
My booklet 0 0