Community Research and Development Information Service - CORDIS

H2020

ARIA-VALUSPA Report Summary

Project ID: 645378
Funded under: H2020-EU.2.1.1.4.

Periodic Reporting for period 1 - ARIA-VALUSPA (Artificial Retrieval of Information Assistants - Virtual Agents with Linguistic Understanding, Social skills, and Personalised Aspects)

Reporting period: 2015-01-01 to 2016-06-30

Summary of the context and overall objectives of the project

The ARIA-VALUSPA project is creating a disruptive new framework that will allow easy creation of Affective Retrieval of Information Assistants (ARIA agents) that are capable of holding multi-modal social interactions in challenging and unexpected situations. The system can generate search queries and return the information requested by interacting with humans through virtual characters. These virtual humans will be able to sustain an interaction with a user for some time, and react appropriately to the user's verbal and non-verbal behaviour when presenting the requested information and refining search results. Using audio and video signals from consumer-grade devices as input, both verbal and non-verbal components of human communication are captured. A sophisticated dialogue management system will decide how to respond to a user's input, be it a spoken sentence, a head nod, or a smile. The ARIA uses specially designed speech synthesisers to create emotionally coloured speech and a fully expressive 3D face to create the chosen response. Back-channelling, indicating that the ARIA is still listening, or understood what the user meant, or returning a smile to encourage the user to continue are but a few of the many ways in which it will employ emotionally coloured social signals to improve communication.
As part of the project, the consortium will develop two specific implementations of ARIAs for two different industrial applications. A ‘speaking book’ application will create an ARIA with a rich personality capturing the essence of a novel, so the user can ask questions about anything related to the novel. Secondly an ARIA scenario will be implemented that is proposed in a business case by one of the Industry Associates at the end of year one. Both applications will have provable commercial value, either to our Industry Partners or Industry Associates.

The ARIA-VALUSPA project builds on the capacities of existing Virtual Humans developed by the consortium partners and/or made publicly available by other researchers, but will greatly enhance them. The assistants will be able to handle unexpected situations and environmental conditions, add self-adaptation, learning, European multilingual skills, and extended dialogue abilities to a multimodal dialogue system. ARIA-VALUSPA prototypes will be developed during the project supporting the three European languages English, French, and German to increase the variety and amount of potential user groups. The framework to be developed will be suitable for multiple platforms, including desktop PCs, laptops, tablets, and ultimately smartphones. The ARIAs will be able to be displayed and operate in a web browser.

The ARIAs will have to deal with unexpected situations that occur during the course of an interaction. Interruptions by the user, unexpected task switching by the user, or a change in who is communicating with the agent (e.g. when a second user joins in the conversation) will require the agents to either interrupt its behaviour, execute a repair behaviour, re-plan mid and long-term actions, or even adapt on the fly to the behaviour of its interlocutor.

Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far

In the first 18 months of the project we delivered the ARIA Framework Milestone 1 on time - meaning that a fully integrated virtual human framework was designed, implemented, and delivered in 12 months. The ARIA Framework is a modular architecture with three major blocks running as independent binaries whilst communicating using ActiveMQ. Each block is in turn modular at a source-code level. The Input block processes audio and video to analyse the user's expressive and interactive behaviour and does speech recognition. The Core Agent block maintains the agent's information state, including its goals and world representation. It is responsible for making queries to its domain-knowledge database to answer questions. Once all goals and states are taken into account it decides on what agent behaviour should be generated. The Output block generates the agent behaviour, that is, it synthesises speech and visual appearance of the virtual human. The ARIA Framework makes use of communication and representation standards wherever possible. For example, by adhering to FML we are able to plug in two different visual behaviour generators, either CNRS' Greta or Cantoche's Living Actor technology.

Based on the ARIA-Framework we delivered the first HTML-5 based Proof of Concept of the Book-ARIA, a Uni-lingual (English) text-based input version that can read passages from Alice in Wonderland. It has a number of 2D animations are pre-generated from 3D models, which is its main limitation. Full streaming of audio-visual input to be processed by the behaviour analysers and output of arbitrary generated behaviour is immediate future work we wish to address.

An important contribution in the first period of the ARIA-VALUSPA project is the NoXi database of mediated Novice-Expert interactions. It consists of 83 dyads recorded in 3 locations (Paris, Nottingham, and Augsbutg) spoken in 7 languages (English, French, German, Spanish, Indonesian, Arabic and Italian). The aim of the endeavour was to collect data to study how humans exchange knowledge in a setting that is as close as possible to the intended human-agent setting of the project. Therefore the interactions were mediated using large screens, cameras, and microphones. Expert/Novice pairs discussed 58 wildly different topics, and an initial analysis of these interactions has already led to a design for the flow of the dialogue between the user and the ARIAs. In addition to information exchange, the dataset was used to collect data to let our agents learn how to classify and deal with 7 different types of interruptions. In total we recorded over 22 hours of synchronised audio, video, and depth data recordings. Efforts are currently ongoing to add semi-automatic annotations to this data.

The ARIA framework's Input block includes state of the art behaviour sensing, many components of which have been specially developed as part of the project. From Audio, we can recognise Gender, Age, Emotion, Speech activity and Turn taking, and a separate module provides speech recognition. Speech recognition is available for the three languages targeted by the project. From Video, we have implemented face recognition, emotion recognition, detailed face and facial point localisation, and head pose estimation.

We attained state of the art results by embracing the hugely popular and successful Deep Learning approach to Machine Learning, but doing so in a smart manner. A compbination of Deep Learning, Cooperative/Transfer/Active Learning, state of the art sub-systems such as facial point localisation and voice activity detection, state of the art databases, and the highest possible expertise on the behaviour analysis domain has resulted in novel systems that go well beyond the previous state of the art in terms of accuracy and speed.

To prepare for improved Behaviour Generation Study of dyadic behaviour between humans two studies have been conducted, one focusing on Dyadic Interruption Classification and the other on Expression of Interpersonal Attitudes. This has first been done on existing databases and is now extended to the NoXi database.

The project has delivered a completely reworked Integrated Speech and Gesture Behaviour Generation system. Instructed by a novel parallel focus Dialogue Manager Architecture, making use of behaviour generation markup standards allows us to visualise the behaviour with either Greata or Living Actor. Both technologies deliver synchronised speech and face synthesis, and aim to include ever more accurately timed reactive behaviour.

In terms of impact, we have reached a high-impact agreement with a major multinational company who will the the sponsor of the Industry ARIA. In terms of academic impact, the consortium has published 54 peer-reviewed, open-access publications as part of the project, which equates to 3 publications per month. Of these, 17 are joint public/private publications.

Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)

The NoXi database was recorded with the aim to be of wide use, beyond the direct goals and aims of the ARIA-VALUSPA project. For example, we focused on information exchange in general, not just on information exchange about Alice in Wonderland. Another example is that we have included recordings of depth information using a Kinect. On the one hand this serves to automatically generate annotations for the project, but while the project will not use depth information, other researchers will probably find this useful. Based on these arguments and given the large size of the database and the large annotation effort going into the database, we are confident that it will be successfully adopted by a large number of researchers.

We attained state of the art behaviour analysis results by embracing the hugely popular and successful Deep Learning approach to Machine Learning, but doing so in a smart manner. A compbination of Deep Learning, Cooperative/Transfer/Active Learning, state of the art sub-systems such as facial point localisation and voice activity detection, state of the art databases, and the highest possible expertise on the behaviour analysis domain has resulted in novel systems that go well beyond the previous state of the art in terms of accuracy and speed.

The project has delivered a completely reworked Integrated Speech and Gesture Behaviour Generation system. Instructed by a novel parallel focus Dialogue Manager Architecture, making use of behaviour generation markup standards allows us to visualise the behaviour with either Greata or Living Actor. Both technologies deliver synchronised speech and face synthesis, and aim to include ever more accurately timed reactive behaviour.

In terms of impact, we have reached a high-impact agreement with a major multinational company who will the the sponsor of the Industry ARIA. In terms of academic impact, the consortium has published 54 peer-reviewed, open-access publications as part of the project, which equates to 3 publications per month. Of these, 17 are joint public/private publications.

Related information

Record Number: 194861 / Last updated on: 2017-02-16