Periodic Reporting for period 2 - SCALER (Provably Efficient Algorithms for Large-Scale Reinforcement Learning)
Periodo di rendicontazione: 2023-04-01 al 2024-09-30
1. Stochastic primal-dual methods for RL. The research plan outlined in Thread B (particularly WP4) has resulted in a sequence of papers developing better and better RL algorithms using tools from constrained optimization. The most notable achievement on this front is the recent preprint "Offline RL via Feature-Occupancy Gradient Ascent", which has perfected the technique to the level that it has resulted in the best currently known algorithm for offline RL in an important class of problems (infinite-horizon linear MDPs). This work builds on our own previous works published at ALT 2023 and AISTATS 2024, as well as concurrent results of Hong and Tewari (ICML 2024) that also builds directly on the same previous works.
2. Optimal transport for Markov chains. In the recent preprint "Bisimulation metrics are optimal transport distances, and can be computed efficiently", we have developed a new framework for computing optimal transport (OT) distances between stochastic processes. This work consolidates decades of research in a variety of areas of computer science, mathematical logic, and probability theory by showing that the successful notion of "bisimulation metrics" used for comparing stochastic processes in some of these communities is in fact an OT distance, and as such techniques from this seemingly unrelated field can be used for efficient computation of distances. By reformulating the resulting OT problem as a linear program (in the vein of the research plan posited in Thread B), we provide a range of new, computationally efficient algorithms for calculating distances between Markov chains, and advocate for the use of these distances for representation learning. The resulting methods are already orders of magnitude faster than the best previously known algorithms, but these are really just the first steps: the new framework opens up many potential directions for future research.
3. Online-to-PAC conversions. In the preprint "Online-to-PAC conversions: Generalization bounds via regret analysis", the PI proposes a new analytic framework for studying the generalization error of statistical learning algorithms. At a high level, the main result in this work shows how a purely statistical question (of uncertainty quantification) can be reduced to a purely algorithmic question (of regret analysis). Among the numerous applications of this result, the potential for advancing the state of the art in RL theory is most relevant to this project. In particular, confidence intervals achieved through the online-to-PAC framework can potentially result in improved methods for exploration-exploitation in online RL and better out-of-sample generalization guarantees for offline RL.
4. DEC and information-directed sampling. The DEC framework of Foster, Kakade, Qian and Rakhlin (2021) has had major impact on the field of sequential decision making. Their work provides a complete characterization of the worst-case performance of learning algorithms in terms of a single quantity called the DEC. The worst-case nature of this characterization is, however, also a limitation of their framework: real-world problems are often far from the worst-case scenarios considered in their analysis, and thus more refined versions of their results are necessary to make them more practical. In the recently published paper "Optimistic Information-Directed Sampling", we provide a more flexible variant of the DEC characterization, which is able to achieve performance guarantees that are able to adapt to the structure of the problem at hand. In particular, we develop an algorithm that can provably improve on the worst-case DEC lower bounds in a way similar to the classic "information-directed sampling" approach of Russo & Van Roy (2018), but without having to make the restrictive Bayesian assumptions that this latter work relied on.