CORDIS - Resultados de investigaciones de la UE
CORDIS

Multimodal Agents Grounded via Interactive Communication

Periodic Reporting for period 1 - MAGIC (Multimodal Agents Grounded via Interactive Communication)

Período documentado: 2018-11-01 hasta 2020-10-31

In the last decades, researchers have made great progress in understanding dialogue, which has recently led to the deployment of commercial speech-controlled personal assistants.The use of machine learning has been central in this process, which allows for these models to be learned directly from data, drastically reducing the amount of human intervention needed.

In particular, modern conversational models are based on neural networks, computing systems that are at the centre of significant breakthroughs in many artificial intelligence areas, such as computer vision, speech processing, and computational linguistics. Given the strong increase in computing power and the massive amount of data we have at our disposal nowadays, end-to-end methods have also led to improvements in dialogue, because they allow for training models on vast amounts of human conversations and with very little manual intervention. Yet, a major problem of current systems is that they are trained entirely via passive supervision: n: They are exposed to a large quantity of human dialogues and asked to reproduce them, a type of learning that neglects the functional aspects of communication. The resulting agents are capable of entertaining chit-chat conversations, but they don’t use language to accomplish anything or work towards a desired situation. Humans instead use language to coordinate with each other and to accomplish tasks in the world, which pressures them to provide coherent and meaningful responses. Furthermore, it means that there is a strong attachment between human language and the world. In other words, our language is grounded in the real world, which is often not the case for an artificial dialogue agent trained with supervision. Such an agent is much like a character trapped in John Searle’s Chinese Room, who compares incoming text against a dictionary.5 Arguably, even though it can successfully pass on the incoming messages, it has no idea what the text refers to, as it has never left the room to interact with the world the text describes.

This research project at the has the ambitious aim of resolving these limitations by introducing a multimodal learning framework where multiple agents will have to cooperate via communication in order to achieve a goal. The framework I propose introduces three important innovations which together provide a significant step forward in training agents that can work side by side with humans using natural language that is grounded in visual perception. First, by introducing a cooperative multimodal game, it proposes a shift from traditional static and passive machine learning approaches for language understanding to a dynamic setting, where agents co-exist and interact with each other, developing a common ground that will help them to communicate. Second, thanks to the specific design of the game, my proposal addresses language inconsistencies from which dialogue agents trained with other methods suffer. To the best of my knowledge, this is the first time that a learning framework is explicitly trying to overcome the issue of linguistic incongruities in artificial multimodal dialogue. The third important strength of the proposal is that it is offering a concrete method for grounding dialogue into the visual configuration of the world, making agents agree on how to refer to objects and their attributes.


My research program has the ambitious aim of resolving all these problems at once using reinforcement learning. More specifically, I had the following three research objectives (RO1 – RO3):

RO1: Train agents to entertain a cooperative and symmetric linguistic interaction that helps them to keep track of their common ground - i.e. their prior dialogue history and their partner-specific established conventions.

RO2: Direct agents’ dialogue learning towards the accomplishment of a goal that is specifically designed to encourage meaningful and coherent conversations.

RO3: Ground the agents’ dialogue in the external visual world, where the agents must agree on how to refer to objects and their attributes.
During the project, I was able to accomplish all my objectives and go also beyond them. I will first discuss each of the three research objectives planned in the research proposal and then discuss also the extra research objectives I managed to tackle.

RO1, linguistic interaction. I addressed this objective by framing the learning within a multi-agent communication setting, where agents progress in their training by interacting with each other via language and are rewarded if the communication is successful. To succeed in the game, the agents were encouraged to agree on how to refer to the visual content, developing shared names and attributes for objects in the pictures.

RO2, goal accomplishment. To address this research goal, I added specific goals that agents had to accomplish by communicating with each other. This helped agents to develop more coherent communication protocols and to discourage incongruities that are very common for dialogue systems trained via static supervision. In particular, I addressed such inconsistencies, using the external visual world to constrain language within consistent boundaries and assure coherent communication.

RO3, visual grounding. I encouraged agents to ground their language into the external world by making them communicating with each other using a language which is tied to the perception of the visual world.p
Modelling natural language dialogue is a key step in the development of intelligent agents that can communicate with humans. Such agents have the potential to assist humans in various ways, such as by aiding them to extract information from ever-growing digital content, or by acting as an interface between a human and a machine. In this project I have developed artificial dialogue agents that can communicate in a cooperative way, using compositional language that is grounded in visual perception. In addition, in later stages, my research will be very likely beneficial for the European society and economy, given the connections of my project to practical applications: Language is the most natural communication mean for humans, and this project makes progress in allowing people to talk to computers, thus (1) making daily operations such as scheduling an appointment to the doctor (with the aid of a dialogue assistant) easier for everybody; (2) addressing the digital divide by making interacting with computers more accessible; (3) opening market opportunities for Europe in the development of automatic dialogue assistants in various fields such as education, healthcare, and ecommerce.
Cartoon of the multi-agent referential game.