We have made considerable progress on all three pillars of the COLT-MDP project: (1) compact representation, (2) efficient algorithms and (3) societal challenges. We have developed a variety of efficient algorithms for various reinforcement learning models, which in many cases use a compact representation. We have studied a variety of societal challenges including, privacy, fairness and safety.
The most prominent model for reinforcement learning is that of Markov Decision Process (MDP). One can conceptualize the MDP by considering, for example, a video game. The states of the MDP encode the current video screen. The actions can be viewed as playing the game, using a controller. Given a current state (screen) and action (in the controller) the MDP moves to a new state (the resulting new screen). The goal is to learn a good policy, a mapping from states (screen) to actions (controller) that would maximize the probability of winning the game.
Our main performance criteria is regret minimization. Regret minimization captures the difference between a learner that has to learn a good policy in an unknown environment compared to an agent that knows the environment and simply uses the optimal policy. The goal of the learner is to have the average difference, per time step, vanish with the number of time steps. This will imply that the learner average performance is near optimal.
Some of the main results achieved thus far are:
(1) We have developed a variety of regret minimization algorithms. One concrete example is the case of contextual MDP [AAAI 2023a], where at each episode a new user (context) arrives, and the user influences both the rewards and dynamics of the environment. The goal is to have a regret bound that is independent of the size of the context class size (number of users), which is potentially huge.
(2) We have addressed the issue of delays in a reinforcement learning environment. In many cases the learner observes the rewards with a delay, and in some cases the delay is unpredictable. We devised algorithms that are able to handle such a challenging delay model, and still achieve a vanishing average regret [AAAI 2022b, NeurIPS 2022b].
(3) In many cases there is more than a single agent that is used to learn the environment. We have studied a model where there are multiple agents that cooperate in order to accelerate the learning process. This cooperation does require coordination between the agents and sharing of information. We are able to develop learning algorithms that achieve this task with near optimal regret [ICML 2022c].
Our research on societal impact included was highly successful.
We have developed efficient privacy preserving algorithms, using differential privacy, for a variety of tasks such as: clustering similar points together [ICML 2021f, ICML 2022d], prediction using linear separators [NeurIPS 2020b], learning optimal action [NeurIPS 2021a] and more. We addressed fairness between user sub-groups by considering multi-calibration , requiring that the average predicted and realized value would be similar. We have derived new generalization bounds, bounding the difference between the predicted and realized values, for the multi-calibration framework [NeurIPS 2020c].