CORDIS - Wyniki badań wspieranych przez UE
CORDIS

Plural Reinforcement Learning

Final Report Summary - PLURELEARN (Plural Reinforcement Learning)

The objective of the proposed research is to establish a new paradigm for learning in large-scale complex dynamic systems under uncertainty. Our goal is to develop algorithms, theory, and applications that use plurality of learning approaches and models in a synergetic way. In order to do that we defined the following specific objectives:

1. Develop a learning approach that combines learning from a teacher and learning by trial and error.
2. Devise a structure discovery methodology for reasoning about uncertainty in high dimensional Markov processes.
3. Come up with approaches for algorithm selection and mini-strategies.


We have made good progress on all three specific objectives which we detail below. All our research was within the Markov decision processes (MDP) formulation focusing on the reinforcement learning paradigm.

Regarding objective 1: we showed in a couple of papers how to use a tutor or expert advice in RL algorithms. Specifically, we considered problems where linear constraints can be added to the learning algorithm representing additional knowledge. We also argued that this knowledge may have to be relaxed, and outlined ways to do it. Overall, we developed new algorithms for the problem of learning from a plurality of sources and showed that it works in medium scale applications.
Regarding objective 2: we showed that the problem of structure discovery is much more complex that thought. In fact, we showed that classical approaches for determining which of several models fits the data best when the data has dependencies (as in MDPs) are bound to fail. We then moved on to developing new approaches that are consistent. Overall, we developed the theoretical and applied aspects of model selection and structure discovery and showed it is much harder to detect dynamic structure than expected. As a remedy to model uncertainty we proposed to use robustness to uncertainty and developed two approaches for mitigating risk. The first is based on policy gradients and is geared towards problem where a simulator is available. The second is based on a robust optimization approach where the focus is on couple uncertainties between states. Both approaches tackle different aspects of the optimization problem and facilitate reasoning about uncertainty.
Regarding objective 3: We considered various approaches to select mini-strategies (also known as “options” or macro-actions). These are strategies the constitute fragments of the overall policy and whose combination may lead to improved performance. In one line of work, we showed how to modify the option as it runs and generate new and improved options. We can view this process as a “model iteration” since the option model is continuously modified leading. In another line of work we considered the problem of option generation. We showed that it is possible to use “randomly generated” options to expedite both planning and learning. The options model improves with time and (random) options are selected and de-selected continuously. We showed how this new model leads to improved performance in both theory and practice.

To summarize, the funded research resulted in a new framework for planning and learning in data-driven stochastic environments. This approach allows combing plurality of information sources (data, simulator, expert and a tutor) for learning and planning faster and more accurately. The research opens up opportunities for large-scale optimization of dynamic systems and may have a significant impact on the scale of problems that we can solve.