Skip to main content
European Commission logo
polski polski
CORDIS - Wyniki badań wspieranych przez UE
CORDIS

Computational Learning Theory: compact representation, efficient computation, and societal challenges in learning MDPs

Periodic Reporting for period 2 - COLT-MDP (Computational Learning Theory: compact representation, efficient computation, and societal challenges in learning MDPs)

Okres sprawozdawczy: 2022-04-01 do 2023-09-30

Artificial Intelligence (AI) and Machine Learning (ML) hold a great promise for advancing humanity over the next decades. While progress in AI was initially slow, mainly due to initial over-expectations, AI made tremendous advances in the last decade. Machine learning is without a doubt the main vehicle that advances AI forward, providing new exciting applications, from human level game playing to numerous Internet applications.

Reinforcement learning (RL), which is a sub-field of machine learning, studies environments where an agent selects actions, and the actions influence her future rewards. For example, in a chess game the moves an agent makes now are aimed at reaching a winning state at the end of the game (rather than optimizing some myopic reward). Over the years reinforcement learning had huge successes, from playing backgammon (in the 1990's) to playing Go and Attari games (in recent years). Still, reinforcement learning has a much larger vision, which encompasses numerous control applications, including autonomous driving and robotics.

As algorithms outcomes impact humans in a significant way, societal challenges in machine learning are becoming increasingly important. Privacy is aimed to secure the individual sensitive information. Fairness addresses potential discrimination of individuals or (protected) groups by algorithms. Safety goal is to guarantee that the RL algorithms would "do no harm". In COLT-MDP we address all this societal issues.

The COLT-MDP project focuses on three important challenges in reinforcement learning: (1) compact representation, which will allow us to represent the huge state spaces that arise in applications, (2) efficient computation, which will enable to run the algorithms with realistic computational resources, and (3) societal challenges, which is a very important societal goal.

Very concisely, our main research goal is:

“Design efficient algorithms for fundamental reinforcement learning problems while addressing societal challenges.”

The success of our project will have a significant and lasting impact on the core of reinforcement learning. Our main aim is to advance the state of the art in reinforcement learning, by developing novel reinforcement learning innovative models and algorithms which are both theoretically sound and practical, and in addition socially responsible.
We have made considerable progress on all three pillars of the COLT-MDP project: (1) compact representation, (2) efficient algorithms and (3) societal challenges. We have developed a variety of efficient algorithms for various reinforcement learning models, which in many cases use a compact representation. We have studied a variety of societal challenges including, privacy, fairness and safety.

The most prominent model for reinforcement learning is that of Markov Decision Process (MDP). One can conceptualize the MDP by considering, for example, a video game. The states of the MDP encode the current video screen. The actions can be viewed as playing the game, using a controller. Given a current state (screen) and action (in the controller) the MDP moves to a new state (the resulting new screen). The goal is to learn a good policy, a mapping from states (screen) to actions (controller) that would maximize the probability of winning the game.

Our main performance criteria is regret minimization. Regret minimization captures the difference between a learner that has to learn a good policy in an unknown environment compared to an agent that knows the environment and simply uses the optimal policy. The goal of the learner is to have the average difference, per time step, vanish with the number of time steps. This will imply that the learner average performance is near optimal.

Some of the main results achieved thus far are:

(1) We have developed a variety of regret minimization algorithms. One concrete example is the case of contextual MDP [AAAI 2023a], where at each episode a new user (context) arrives, and the user influences both the rewards and dynamics of the environment. The goal is to have a regret bound that is independent of the size of the context class size (number of users), which is potentially huge.

(2) We have addressed the issue of delays in a reinforcement learning environment. In many cases the learner observes the rewards with a delay, and in some cases the delay is unpredictable. We devised algorithms that are able to handle such a challenging delay model, and still achieve a vanishing average regret [AAAI 2022b, NeurIPS 2022b].

(3) In many cases there is more than a single agent that is used to learn the environment. We have studied a model where there are multiple agents that cooperate in order to accelerate the learning process. This cooperation does require coordination between the agents and sharing of information. We are able to develop learning algorithms that achieve this task with near optimal regret [ICML 2022c].


Our research on societal impact included was highly successful.
We have developed efficient privacy preserving algorithms, using differential privacy, for a variety of tasks such as: clustering similar points together [ICML 2021f, ICML 2022d], prediction using linear separators [NeurIPS 2020b], learning optimal action [NeurIPS 2021a] and more. We addressed fairness between user sub-groups by considering multi-calibration , requiring that the average predicted and realized value would be similar. We have derived new generalization bounds, bounding the difference between the predicted and realized values, for the multi-calibration framework [NeurIPS 2020c].
The results obtained so far have achieved state of the art in multiple tasks and domains. For example:
1. The state of art regret bound (which is also near optimal) for learning stochastic shortest paths [NeurIPS 2021f].
2. A cooperative learning model and a near optimal regret algorithm [ICML 2022c].
3. The state of art delay aware learning algorithms [AAAI 2022b, NeurIPS 2022b].
4. Designing highly robust algorithms based on differential privacy methodologies [NeurIPS 2020d, JACM 2022, STOC 2022].
5. The state of art generalization bounds for multi-calibration predictors [NeurIPS 2020c]
6. Introduce adversarial rewards to dueling bandits, and deriving near optimal regret [ICML 2021a]
7. Introduced the notion of teams to dueling bandits and derives near optimal regret [NeurIPS 2021c]
8. The state of art factored MDP learning algorithm [NeurIPS 2021h].
9. The state of art multi-agents private learning best action [NeurIPS 2021a]

In the future we plan to continue and study fundamental questions in reinforcement learning and societal challenges, following the COLT-MDP project proposal. We plan to improve and extend our results for contextual MDPs, so that they will apply also to adversarial rewards. We plan to consider the model of linear MDP, and improve its state of art regret minimization bounds. We will consider multiple learning agents interacting both in cooperating and competitive environments. We plan to develop additional differentially private and fairness related algorithms for a variety of machine learning and reinforcement learning tasks. We expect our future results will improve the state of the art, as well as advancing our understanding of reinforcement learning methodology for general AI.