## Final Report Summary - ESDEMUU (Efficient sequential decision making under uncertainty)

One interesting problem is efficient Bayesian methods for sequential decision making and in particular, for the reinforcement learning problem. In this problem, the learning agent must learn to act in an unknown environment, solely via interaction, in order to maximise the total reward obtained during its lifetime. Bayesian reinforcement learning approaches maintain a distribution over the unknown environment parameters to represent the agent's belief about the environment. It is then theoretically possible to create plans which optimally balance exploration of the environment with the accumulation of reward.

One particular difficulty arises when inference over the unknown parameters can be hard, i.e. when the parameter distribution is hard to compute. For that reason, I developed some extensions of context trees, with application to variable-order Markov model estimation and conditional density estimation. These models can be used in conjunction with distributed value functions in order to perform decision making in reinforcement learning problems.

Given a particular inference mechanism, it is theoretically possible to find an Bayes-optimal policy for any sequential decision problem, including reinforcement learning. However, there are many practical hurdles which make this infeasible. One of my early ideas was to performing backward induction over a cleverly expanded state-belief tree. Such methods can be made more efficient with better upper and lower bounds on the value of each tree node, which I explored in. Recently, I worked on an improvement of those bounds and showed that, without any tree search, better bounds could be sufficient to obtain robust policies. In current work, I have extended these methods to gradient-based approaches, which have much lower computational complexity. It is still an open question under which conditions the methods in perform better than methods based purely on confidence bounds, or ones that employ variants of upper confidence bound-based tree search (UCT).

Inverse reinforcement learning is another sequential decision making problem. This involves learning how to act from demonstrations, or inferring the preferences of some other agent that acts within the environment. My work with C. Rothkopf recently demonstrated that a principled generalisation of Bayesian inverse reinforcement learning can produce state-of-the-art results in many problems. More recently, we extended this method to the previously unconsidered problem of learning from multiple teachers with different preferences, to unknown environments and to stochastic games. There are potentially many applications, including the modelling of group and individual preferences in advertising and in the social sciences.

In my work with B. Ottens and B. Faltings at EPFL, we examined the problem of finding an optimal policy for a group of communicating agents with only limited communication, where the utility function is additive, but involves infeasible solutions (otherwise known as DCOP). We developed randomised and confidence bound based algorithms that can solve such problems efficiently. Future work may involve extending this to more structured policies.

Many problems in reinforcement learning are continuous, partially observable or both. In recent work, we look at Bayesian approaches for linear and piecewise-linear models, which can be used to model the system dynamics, obtaining good experimental results.

A novel result that is widely applicable to continuous, partially observable and multi-agent problems is the introduction of approximate Bayesian computation (ABC) methods for likelihood-free inference to reinforcement learning.

Finally, frequently we are faced with many difficult, but similar, problems. In recent work, I propose a novel framework to model this, that of sparse reward processes, where a learning agent is placed in an unknown environment and is faced with a series of adversarially selected goals. This problem is sparse in two ways. Firstly, the agent may not have a specific goal to perform immediately after its current goal is attained. Secondly, the optimal adversary policy implies that most of the environmental states will have zero reward. There is also some relation between this and my earlier work with M. Lagoudakis on rollout sampling, where we examined methods to efficiently find good agent policies in large reinforcement learning problems.

The main applications looked at is security. In our latest work with A. Mitrokotsa and S. Vaudenay, we obtained loss bounds for a class of cryptographic authentication problems where the channel is constrained: sending messages is expensive and unreliable, while different errors in authentication decisions carry different costs. This is a setting not usually considered in cryptography, where a perfect channel model is used and the only quantity of interest is the complexity required to achieve a particular level of security.

One particular difficulty arises when inference over the unknown parameters can be hard, i.e. when the parameter distribution is hard to compute. For that reason, I developed some extensions of context trees, with application to variable-order Markov model estimation and conditional density estimation. These models can be used in conjunction with distributed value functions in order to perform decision making in reinforcement learning problems.

Given a particular inference mechanism, it is theoretically possible to find an Bayes-optimal policy for any sequential decision problem, including reinforcement learning. However, there are many practical hurdles which make this infeasible. One of my early ideas was to performing backward induction over a cleverly expanded state-belief tree. Such methods can be made more efficient with better upper and lower bounds on the value of each tree node, which I explored in. Recently, I worked on an improvement of those bounds and showed that, without any tree search, better bounds could be sufficient to obtain robust policies. In current work, I have extended these methods to gradient-based approaches, which have much lower computational complexity. It is still an open question under which conditions the methods in perform better than methods based purely on confidence bounds, or ones that employ variants of upper confidence bound-based tree search (UCT).

Inverse reinforcement learning is another sequential decision making problem. This involves learning how to act from demonstrations, or inferring the preferences of some other agent that acts within the environment. My work with C. Rothkopf recently demonstrated that a principled generalisation of Bayesian inverse reinforcement learning can produce state-of-the-art results in many problems. More recently, we extended this method to the previously unconsidered problem of learning from multiple teachers with different preferences, to unknown environments and to stochastic games. There are potentially many applications, including the modelling of group and individual preferences in advertising and in the social sciences.

In my work with B. Ottens and B. Faltings at EPFL, we examined the problem of finding an optimal policy for a group of communicating agents with only limited communication, where the utility function is additive, but involves infeasible solutions (otherwise known as DCOP). We developed randomised and confidence bound based algorithms that can solve such problems efficiently. Future work may involve extending this to more structured policies.

Many problems in reinforcement learning are continuous, partially observable or both. In recent work, we look at Bayesian approaches for linear and piecewise-linear models, which can be used to model the system dynamics, obtaining good experimental results.

A novel result that is widely applicable to continuous, partially observable and multi-agent problems is the introduction of approximate Bayesian computation (ABC) methods for likelihood-free inference to reinforcement learning.

Finally, frequently we are faced with many difficult, but similar, problems. In recent work, I propose a novel framework to model this, that of sparse reward processes, where a learning agent is placed in an unknown environment and is faced with a series of adversarially selected goals. This problem is sparse in two ways. Firstly, the agent may not have a specific goal to perform immediately after its current goal is attained. Secondly, the optimal adversary policy implies that most of the environmental states will have zero reward. There is also some relation between this and my earlier work with M. Lagoudakis on rollout sampling, where we examined methods to efficiently find good agent policies in large reinforcement learning problems.

The main applications looked at is security. In our latest work with A. Mitrokotsa and S. Vaudenay, we obtained loss bounds for a class of cryptographic authentication problems where the channel is constrained: sending messages is expensive and unreliable, while different errors in authentication decisions carry different costs. This is a setting not usually considered in cryptography, where a perfect channel model is used and the only quantity of interest is the complexity required to achieve a particular level of security.