Skip to main content
European Commission logo print header

Knowledge-Based Information Agent with Social Competence and Human Interaction Capabilities

Periodic Reporting for period 3 - KRISTINA (Knowledge-Based Information Agent with Social Competence and Human Interaction Capabilities)

Período documentado: 2017-03-01 hasta 2018-02-28

Migrants in Europe often do not to speak the language of the resident country and are not acquainted with its culture. Furthermore, they are unfamiliar with the care and health administrations. As a result, e.g. elderly migrants in care homes suffer from social exclusion, with their relatives also struggling with getting the right information and interacting with the administration; migrants at home are reluctant to go to see the doctor in case of health issues; and temporary migrant care workers lack background information on the persons they care. KRISTINA’s overall objective has been to research and develop technologies for a socially competent and communicative embodied agent that serves for migrants with language and cultural barriers in the host country as a trusted information provision party and mediator in questions related to basic care and healthcare. To develop such an agent, a number of scientific and technological challenges were tackled. We ensured that the agent is capable to understand and correctly interpret the concerns of the user expressed by a combination of facial, gestural and verbal signals in culturally influenced circumstances. To this end, advanced multilingual speech and language analysis, multimodal affective state identification and semantic reasoning techniques have been developed. Once understood and interpreted the concerns, the agent decides in its dialogue management module on the correct reaction: to search for and deliver information, to delegate the patient to a health specialist (providing necessary instructions), to transmit to care personnel the habits and preferences of the care recipient, to assist or coach the user, etc. Under all circumstances, the agent is aimed to conduct an informed expressive multimodal interaction with the user. This has been achieved by developing advanced spoken language generation technologies and human-like virtual characters that are adapted to the cultural and conversational contexts. To be of maximal use to the user and to have a maximal impact, the KRISTINA agent is run on tablets and PCs/laptops. The agent has been developed in a number of SW development cycles. Each of the cycles was concluded by a prolonged evaluation trial with a representative number of migrants recruited as users from the migration circles identified as especially in need in two European countries: elderly Turkish migrants and their relatives and short term Polish care providers in Germany and North African migrants in Spain. Given that an intelligent social conversation agent as developed in KRISTINA has the potential to considerably improve the basic care and healthcare, KRISTINA presents a clear business case.
The KRISTINA Consortium carried out substantial work on all topics related to its goal to develop a multilingual conversation agent with social and cultural competence. A generic dialogue manager with the ability to fuse multimodal knowledge structures and to adapt to the cultural background of the user by dialogue strategies trained on culture-specific dialogues has been implemented. In the context of speech recognition, a total of 53 hours of manually transcribed data were used to adapt VR’s pre-existing speech-to-text models to real-time contexts. The adaptation led to accuracy gains between 24% and 50% relative, measured on KRISTINA data. Furthermore, the feature analysis, acoustic models and speech recognition have been tuned to reduce the latency (<1 second in 75% of the cases) for all KRISTINA languages. For syntactic analysis of transcriptions, a Stack LSTM-based model has been developed and substantial treebanks for German, Polish, Spanish and Turkish have been annotated. For semantic analysis, deep-syntactic dependency parses are projected onto FrameNet-based structures, which are then translated into OWL representations. The translation is aligned with the DOLCE Ultralight ontology and with the patterns of the Descriptions and Situations ontology. For non-verbal communication analysis, over 5 hours of dialogue recordings have been annotated in terms of Valence and Arousal (V-A) labels; to ensure reliable ground truth, a new consensus methodology has been developed. Several non-verbal modules have been realized: (i) frame-wise real-time V-A detection from audio signals with Neural Networks; (ii) emotional state derivation based on real time facial imaging, which generates output in the V-A space and in terms of prototypical facial expressions; (iii) hand and body posture tracking and heuristic derivation of the level of Arousal of the user based on the tracking outcome; (iv) body model for recognition of pointing gestures based on hand positions relative to the body. For fusion of the input from different modules, a novel asynchronous event-based methodology has been researched. For knowledge representation, ontological models have been defined and a knowledge base (KB) has been populated to capture background information, user models, conversation histories, etc. A flexible context extraction algorithm has been developed that employs semantic similarity metrics and graph expansion techniques to match user statements against KB structures. A conversation context aware reasoning module has been implemented that is instrumental in retrieving relevant information from the KB that is to be communicated to the user. To ensure the availability of required background information in which a user might be interested, dedicated search engines have been developed to retrieve information from the web (entire pages or passages) and from social media. In the area of multimodal communication generation, mode selection, language generation, and expressive avatar creation has been worked on. A cognitively motivated rule-and classified-based mode selection has been implemented that treats voice, mimics and gestures as equal (instead of assuming voice to be the dominant mode to which the other modi are added). In the context of language generation, this involved: multiple SVM-based syntactic sentence generation, extension of lexical and grammatical resources for multilingual rule-based graph transducer-based sentence generation, NN-based punctuation generation, and expressive prosody generation. For expressive avatar creation, a web-based pipeline for the generation of characters has been set up and a number of human-like characters were created. To interface with modality selection, a BML Realizer and a BML Planner have been developed.
We go beyond of the state of the art in nearly all tackled research areas. Thus, the KRISTINA DM selects culture-oriented dialogue actions dynamically using reasoning technologies on ontologies (instead of using predefined scripts). The ontologies are designed to ingest the outcome of the analysis of the multimodal interaction of the user and integrate it with the context, background knowledge and user profile information. To analyze the multimodal interactions of the users, cutting edge speech recognition, language analysis and non-verbal analysis techniques have been developed respectively adapted to the KRISTINA topics and languages. To be highlighted is also the novel asynchronous non-verbal fusion technique. On the side of multimodal communication generation, in particular, the multilingual language synthesis, expressive prosody generation and the flexible character design platform are to be mentioned.
The figure shows the architecture of the KRISTINA agent after its first year of the lifetime.