Skip to main content

Artificial Retrieval of Information Assistants - Virtual Agents with Linguistic Understanding, Social skills, and Personalised Aspects

Periodic Reporting for period 2 - ARIA-VALUSPA (Artificial Retrieval of Information Assistants - Virtual Agents with Linguistic Understanding, Social skills, and Personalised Aspects)

Reporting period: 2016-07-01 to 2017-12-31

The ARIA-VALUSPA project is creating a disruptive new framework that will allow easy creation of Affective Retrieval of Information Assistants (ARIA agents) capable of holding multi-modal social interactions in challenging and unexpected situations. The system can generate search queries and return the information requested. The virtual humans will be able to sustain an interaction with a user for some time, and react appropriately to the user's verbal and non-verbal behaviour. Using audio and video signals from consumer-grade devices as input, both verbal and non-verbal components of human communication are captured. A sophisticated dialogue management system will decide how to respond to a user's input, be it a spoken sentence, a head nod, or a smile. The ARIA uses specially designed speech synthesisers to create emotionally coloured speech and a fully expressive 3D face to create the chosen response. Back-channelling, indicating that the ARIA is still listening, or understood what the user meant etc. are but a couple of the many ways in which it will employ emotionally coloured social signals to improve communication.

As part of the project, the consortium will develop two specific implementations of ARIAs for two different industrial applications. A ‘speaking book’ application will create an ARIA with a rich personality capturing the essence of a novel, so the user can ask questions about anything related to the novel. Secondly an ARIA scenario will be implemented that is proposed in a business case by one of the Industry Associates at the end of year one. Both applications will have provable commercial value, either to our Industry Partners or Industry Associates. ARIA-VALUSPA prototypes will be developed during the project supporting the three European languages English, French, and German to increase the variety and amount of potential user groups. The framework to be developed will be suitable for multiple platforms, including desktop PCs, laptops, tablets, and ultimately smartphones. The ARIAs will be able to be displayed and operate in a web browser.

The ARIAs will have to deal with unexpected situations that occur during the course of an interaction. Interruptions by the user, unexpected task switching by the user, or a change in who is communicating with the agent (e.g. when a second user joins in the conversation) will require the agents to either interrupt its behaviour, execute a repair behaviour, re-plan mid and long-term actions, or even adapt on the fly to the behaviour of its interlocutor.
The ARIA-VALUSPA Platform (AVP) is the main output of this project. It is a modular architecture with three major blocks running as independent binaries whilst communicating using ActiveMQ. It is available from https://github.com/ARIA-VALUSPA/AVP

The ARIA framework's Input block includes state of the art behaviour sensing, many components of which have been specially developed as part of the project. From Audio, we can recognise Gender, Age, Emotion, Speech activity and Turn taking, and a separate module provides speech recognition. Speech recognition is available for the three languages targeted by the project. From Video, we have implemented face recognition, emotion recognition, detailed face and facial point localisation, and head pose estimation.

Based on the ARIA-Framework we delivered a complete tri-lingual (English, Fench, German) Virtual Human representing Alice in Wonderland. It has been realised in Greta (Ogre), LA, and Unity3D. It makes full use of the Behaviour Analysis provided from audio and video.

An important contribution in the first period of the ARIA-VALUSPA project is the NoXi database of mediated Novice-Expert interactions. It consists of 83 dyads recorded in 3 locations (Paris, Nottingham, and Augsburg) spoken in 7 languages (English, French, German, Spanish, Indonesian, Arabic and Italian). The aim of the endeavour was to collect data to study how humans exchange knowledge in a setting that is as close as possible to the intended human-agent setting of the project.

We attained state of the art results by embracing the hugely popular and successful Deep Learning approach to Machine Learning, but doing so in a smart manner. A compbination of Deep Learning, Cooperative/Transfer/Active Learning, state of the art sub-systems such as facial point localisation and voice activity detection, state of the art databases, and the highest possible expertise on the behaviour analysis domain has resulted in novel systems that go well beyond the previous state of the art in terms of accuracy and speed.

The project has delivered a completely reworked Integrated Speech and Gesture Behaviour Generation system. Instructed by a novel parallel focus Dialogue Manager Architecture, making use of behaviour generation markup standards allows us to visualise the behaviour with either Greata or Living Actor. Both technologies deliver synchronised speech and face synthesis, and aim to include ever more accurately timed reactive behaviour.

In terms of impact, we have reached a high-impact agreement with a major multinational company who will the the sponsor of the Industry ARIA. In terms of academic impact, the consortium has published 54 peer-reviewed, open-access publications as part of the project, which equates to 3 publications per month. Of these, 17 are joint public/private publications.
The NoXi database was recorded with the aim to be of wide use, beyond the direct goals and aims of the ARIA-VALUSPA project. For example, we focused on information exchange in general, not just on information exchange about Alice in Wonderland. Another example is that we have included recordings of depth information using a Kinect. On the one hand this serves to automatically generate annotations for the project, but while the project will not use depth information, other researchers will probably find this useful. Based on these arguments and given the large size of the database and the large annotation effort going into the database, we are confident that it will be successfully adopted by a large number of researchers.

We attained state of the art behaviour analysis results by embracing the hugely popular and successful Deep Learning approach to Machine Learning, but doing so in a smart manner. A compbination of Deep Learning, Cooperative/Transfer/Active Learning, state of the art sub-systems such as facial point localisation and voice activity detection, state of the art databases, and the highest possible expertise on the behaviour analysis domain has resulted in novel systems that go well beyond the previous state of the art in terms of accuracy and speed.

The project has delivered a completely reworked Integrated Speech and Gesture Behaviour Generation system. Instructed by a novel parallel focus Dialogue Manager Architecture, making use of behaviour generation markup standards allows us to visualise the behaviour with either Greata or Living Actor. Both technologies deliver synchronised speech and face synthesis, and aim to include ever more accurately timed reactive behaviour.

In terms of impact, we have reached a high-impact agreement with a major multinational company who will the the sponsor of the Industry ARIA. In terms of academic impact, the consortium has published 93 peer-reviewed, open-access publications as part of the project, which equates to almost 3 publications per month.