Skip to main content

Knowledge-Based Information Agent with Social Competence and Human Interaction Capabilities

Periodic Reporting for period 2 - KRISTINA (Knowledge-Based Information Agent with Social Competence and Human Interaction Capabilities)

Reporting period: 2016-03-01 to 2017-02-28

Europe is on the move. This is not free of challenges. Especially in the case of care, migrants, often face a double challenge in their new resident country: (i) not to speak the language and not to be acquainted with the culture, and (ii) be unfamiliar with the care and health administrations. As a consequence, e.g. elderly migrants in care homes suffer from social exclusion, with their relatives also struggling with getting the right information and interacting with the administration; migrants at home are often reluctant to go to see the doctor in case of health issues; migrant temporary care workers face lack of professional background information and deficient communication with both the cared and the supervision personnel. KRISTINA’s overall objective is to research and develop technologies for a human-like socially competent and communicative agent that serves for migrants with language and cultural barriers in the host country as a trusted information provision party and mediator in questions related to basic care and healthcare. Patients, elderly in need of care, and care personnel are targeted. To develop such an agent, a number of scientific and technological challenges must be tackled. It must be ensured that the agent is capable to understand and correctly interpret the concerns of the user expressed by a combination of facial, gestural and verbal signals in difficult, culturally influenced circumstances. Once understood and interpreted the concerns, the agent must be able to decide on the correct reaction to them: to search for and deliver information, to delegate the user to a health specialist (providing necessary instructions), transmit to the health specialist or care personnel the urgencies of the user, to assist or coach the user, etc. Under all circumstances, the agent must be able to conduct an informed flexible spoken language dialogue with the user. To be of maximal use to the user and to have a maximal impact, the KRISTINA agent is targeted to run on tablets/smartphones and PCs/laptops. The agent is developed in a number of SW development cycles. Each of the cycles is concluded by a prolonged evaluation trial with a representative number of migrants recruited as users from the migration circles identified as especially in need in two European countries: elderly Turkish migrants and their relatives and short term Polish care giving personnel in Germany and North African migrants in Spain. Given that an intelligent social conversation agent as envisaged in KRISTINA has the potential to considerably improve the basic care and healthcare, KRISTINA presents a clear business case.
Substantial work has been performed by now on all relevant topics. A first version of the generic adaptive dialogue manager with the ability to seamlessly integrate knowledge structures and to use general features of dialogue actions has been implemented. In the context of speech recognition, a total of 40 hours of manually transcribed data and 34 million words from textual material were used to adapt Vocapia’s speech-to-text models to the KRISTINA topics. The feature analysis, acoustic models and speech recognition have been adapted to reduce the latency (<1 second in 75% of the cases). Real-time speech recognition was developed for German, Polish, and Spanish. Acoustic models have been improved and language models built for German, Polish, Spanish, and Turkish. For syntactic analysis of transcriptions, a Stack LSTM-based model has been developed. In the course of semantic analysis, deep-syntactic dependency parses are projected onto FrameNet-based structures, which are then translated into OWL representations. The translation is aligned with the DOLCE Ultralight ontology and with the patterns of the Descriptions and Situations ontology.
For non-verbal communication analysis, about 5 hours of dialogue recordings have been annotated using emotional Valence and Arousal (V-A) labels. For the sake of a reliable ground truth, a new consensus methodology has been developed. In addition, the following non-verbal modules have been realized: (i) frame-wise real-time V-A detection from audio signals using Neural Networks; (ii) emotional state derivation based on real time facial imaging, which generates emotional output in the V-A space and in terms of prototypical facial expressions; (iii) hand and body posture tracking and heuristic derivation of the level of Arousal of an individual based on the tracking outcome; (iv) body model for recognition of pointing gestures based on hand positions relative to the body. The fusion of the modalities is performed using a novel asynchronous event-based methodology. As far as the knowledge representation is concerned, ontological models have been defined and a knowledge base (KB) has been populated to capture background information, user models, conversation histories, etc. A flexible context extraction algorithm has been developed that employs semantic similarity metrics and graph expansion techniques to match user statements against KB structures. A reasoning module has been implemented that is instrumental in retrieving relevant information from the KB that is to be communicated to the user. To ensure the availability of required background information on health and basic care as well as further information in which a user might be interested, a dedicated search engine has been realized. The engine is capable of retrieving both entire web pages and web page passages. In the area of multimodal communication generation, modality selection, language generation, and expressive avatar creation has been worked on. Thus, the first version of a cognitively motivated selection has been implemented that treats all three modi (voice, mimics and gestures) as equal (instead of assuming voice to be the dominant mode from which the other modi are derived). In the context of language generation, this involved: multiple SVM-based syntactic sentence generation, extension of lexical and grammatical resources for multilingual rule-based graph transducer-based sentence generation, NN-based punctuation generation, and expressive prosody generation.
In the context of the work on expressive avatar creation, a web-based pipeline for the generation of characters has been set up, a basic version of an open BML Realizer on the web as well as a basic version of a BML Planner for nonverbal communication have been developed.
Progress beyond of the state of the art has been achieved in nearly all tackled research areas. Thus, unlike state-of-the-art dialogue managers (DMs), which tend to rely upon predefined dialogue scripts, the KRISTINA DM selects dialogue actions dynamically using reasoning technologies on ontologies. The ontologies are designed to ingest the outcome of the analysis of the multimodal interaction of the user and integrate it with the context, background knowledge and user profile information. To analyze the multimodal interactions of the users, cutting edge speech recognition, language analysis and non-verbal analysis techniques have been developed respectively adapted to the KRISTINA topics and languages. To be highlighted in this context is also the novel asynchronous modality fusion technique. On the side of multimodal communication generation, in particular, the multilingual language synthesis, expressive prosody generation and the flexible character design platform are to be mentioned.
The figure shows the architecture of the KRISTINA agent after its first year of the lifetime.