Periodic Reporting for period 1 - MAGIC (Multimodal Agents Grounded via Interactive Communication)
Periodo di rendicontazione: 2018-11-01 al 2020-10-31
In particular, modern conversational models are based on neural networks, computing systems that are at the centre of significant breakthroughs in many artificial intelligence areas, such as computer vision, speech processing, and computational linguistics. Given the strong increase in computing power and the massive amount of data we have at our disposal nowadays, end-to-end methods have also led to improvements in dialogue, because they allow for training models on vast amounts of human conversations and with very little manual intervention. Yet, a major problem of current systems is that they are trained entirely via passive supervision: n: They are exposed to a large quantity of human dialogues and asked to reproduce them, a type of learning that neglects the functional aspects of communication. The resulting agents are capable of entertaining chit-chat conversations, but they don’t use language to accomplish anything or work towards a desired situation. Humans instead use language to coordinate with each other and to accomplish tasks in the world, which pressures them to provide coherent and meaningful responses. Furthermore, it means that there is a strong attachment between human language and the world. In other words, our language is grounded in the real world, which is often not the case for an artificial dialogue agent trained with supervision. Such an agent is much like a character trapped in John Searle’s Chinese Room, who compares incoming text against a dictionary.5 Arguably, even though it can successfully pass on the incoming messages, it has no idea what the text refers to, as it has never left the room to interact with the world the text describes.
This research project at the has the ambitious aim of resolving these limitations by introducing a multimodal learning framework where multiple agents will have to cooperate via communication in order to achieve a goal. The framework I propose introduces three important innovations which together provide a significant step forward in training agents that can work side by side with humans using natural language that is grounded in visual perception. First, by introducing a cooperative multimodal game, it proposes a shift from traditional static and passive machine learning approaches for language understanding to a dynamic setting, where agents co-exist and interact with each other, developing a common ground that will help them to communicate. Second, thanks to the specific design of the game, my proposal addresses language inconsistencies from which dialogue agents trained with other methods suffer. To the best of my knowledge, this is the first time that a learning framework is explicitly trying to overcome the issue of linguistic incongruities in artificial multimodal dialogue. The third important strength of the proposal is that it is offering a concrete method for grounding dialogue into the visual configuration of the world, making agents agree on how to refer to objects and their attributes.
My research program has the ambitious aim of resolving all these problems at once using reinforcement learning. More specifically, I had the following three research objectives (RO1 – RO3):
RO1: Train agents to entertain a cooperative and symmetric linguistic interaction that helps them to keep track of their common ground - i.e. their prior dialogue history and their partner-specific established conventions.
RO2: Direct agents’ dialogue learning towards the accomplishment of a goal that is specifically designed to encourage meaningful and coherent conversations.
RO3: Ground the agents’ dialogue in the external visual world, where the agents must agree on how to refer to objects and their attributes.
RO1, linguistic interaction. I addressed this objective by framing the learning within a multi-agent communication setting, where agents progress in their training by interacting with each other via language and are rewarded if the communication is successful. To succeed in the game, the agents were encouraged to agree on how to refer to the visual content, developing shared names and attributes for objects in the pictures.
RO2, goal accomplishment. To address this research goal, I added specific goals that agents had to accomplish by communicating with each other. This helped agents to develop more coherent communication protocols and to discourage incongruities that are very common for dialogue systems trained via static supervision. In particular, I addressed such inconsistencies, using the external visual world to constrain language within consistent boundaries and assure coherent communication.
RO3, visual grounding. I encouraged agents to ground their language into the external world by making them communicating with each other using a language which is tied to the perception of the visual world.p