Community Research and Development Information Service - CORDIS

H2020

KRISTINA Report Summary

Project ID: 645012
Funded under: H2020-EU.2.1.1.4.

Periodic Reporting for period 1 - KRISTINA (Knowledge-Based Information Agent with Social Competence and Human Interaction Capabilities)

Reporting period: 2015-03-01 to 2016-02-29

Summary of the context and overall objectives of the project

In Europe, migration is tradition – and not only since the European legislation changed towards free migration of European citizens. This is not free of challenges. Especially in the case of care, migrants, often face a double challenge: (i) not to speak the language and not to be acquainted with the culture of the resident country, and (ii) be unfamiliar with the care and health administrations of the country. As a consequence, e.g., elderly migrants in care homes suffer from social exclusion, with their relatives also struggling with getting the right information and interacting with the administration, migrants at home are often reluctant to go to see the doctor in case of health issues, a tendency that is often further aggravated by cultural matters. Migrant temporary care workers, who in addition often do not have an adequate professional training, face the problem of isolation, lack of professional background information and deficient communication with both the cared and the supervision personnel. KRISTINA’s overall objective is to research and develop technologies for a human-like socially competent and communicative agent that serves for migrants with language and cultural barriers in the host country as a trusted information provision party and mediator in questions related to basic care and healthcare. Patients, elderly in need of care, and care personnel are concerned: as patients, migrants at risk of cultural and social exclusion are reluctant to go to see the doctor in case of health issues because they feel ashamed, uneasy and not well understood; as elderly (often in mental deterioration) in need of care they are in deep isolation, with their relatives often overburdened; and as care personnel they are hampered in communication, self-education, etc. To develop such an agent, a number of scientific and technological challenges must be tackled. It must be ensured that the agent is capable to understand and correctly interpret the concerns of the user expressed by a combination of facial, gestural and verbal signals in difficult, culturally influenced circumstances. Once understood and interpreted the concerns, the agent must be able to decide on the correct reaction to them: to search for and deliver information, to delegate the user to a health specialist (providing necessary instructions), transmit to the health specialist or care personnel the urgencies of the user, to assist or coach the user, etc. Under all circumstances, the agent must be able to conduct an informed flexible spoken language dialogue with the user. To be of maximal use to the user and to have a maximal impact, the KRISTINA agent is to run on tablets/smartphones and PCs/laptops. The technologies will be validated in a series of use cases, in which prolonged trials will be carried out for each prototype that marks the termination of a SW development cycle, with a representative number of migrants recruited as users from the migration circles identified as especially in need in two European countries: elderly Turkish migrants and their relatives and short term Polish care giving personnel in Germany and North African migrants in Spain. Given that an intelligent social coaching agent as envisaged in KRISTINA has the potential, on the one side, to considerably improve the basic care and healthcare and, on the other side, to significantly reduce the costs for the sanitary systems of the member states, KRISTINA furthermore presents a clear business case.

Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far

The project just completed the first year of its lifetime. During this year, substantial work has been carried out and substantial advances (reflected in 20 publications) have been made towards the achievement of the objectives of the project. This work concerned all central topics tackled in KRISTINA's DoA: advanced dialogue management models, analysis of multimodal verbal communication, analysis of non-verbal communication, building up of domain-specific ontologies and reasoning over these ontologies, search of communication-relevant information in the web, and multimodal communication generation, including design and realization of virtual characters that are suitable for interaction with the targeted users of KRISTINA. In addition, the technical side has been advanced in that a roadmap for the development of the KRISTINA agent has been designed, an operational infrastructure of the agent has been implemented and a number of activities have been carried out towards the implementation of the first prototype of the agent (cf. the figure with the architecture). In parallel, pilot use cases that shall validate the performance of the KRISTINA agent at different stages of its development have been spelled out in more detail and monitoring of ethical issues has been put in place.
The first part of the work on all topics was constituted by empirical studies of the sample dialogues in different languages recorded to simulate the interaction of the KRISTINA agent with users as targeted in the project. In the context of each topic, topic-relevant aspects were analyzed.
The work on advanced dialogue management models furthermore included research on culture-dependent characteristics of (potentially emotive) dialogues, design and implementation of a basic dialogue manager (DM), and research on the advanced dialogue manager as well as on semantic fusion of the input from the different modalities in the communication of users. In the context of the work on the basic DM, the dialogue manager available at the University of Ulm has been adapted to allow for interaction with all relevant modules of the KRISTINA agent and the corresponding interfaces have been spelled out. Furthermore the Visual SceneMaker (VSM) platform developed by the University of Augsburg and the ALMA component developed by DFKI have been integrated. VSM will handle in KRISTINA's basic DM the idle behavior of the agent, while ALMA will serve as basis for emotion generation. With respect to the advanced DM, the classification of user utterances in the dialogue context, dynamic generation of system actions, and learning of a generic adaptive dialogue strategy have been worked on. In the context of the task on semantic fusion of input from different modalities, a qualitative analysis of the recorded dialogues was performed. Based on the outcome of this analysis, a conceptual framework capable of handling semantic fusion has been specified. This framework consists, on the one hand, of a uniform semantic representation format for multimodal input, categorizations of gestures, facial expressions, and speech acts, and, on the other hand, of approaches for the classification of multimodal expressions, and a procedural description of the fusion algorithm.
After the empirical study, the work on the analysis of multimodal communication targeted multilingual automatic speech recognition and semantic multilingual language analysis. In the scope of the task on speech recognition, research has been done on Deep Neural Network (DNN) acoustic models using bottleneck feature that draw upon TRAP-DCT features and testing different configuration of DNN training based on bottleneck features with HMM and GMM adaptation and fully-connected DNN or p-norm DNN. Furthermore, work has been carried out on the individual languages covered in KRISTINA (Arabic, German, Polish, Spanish, and Turkish), targeting such aspects as pronunciation modeling, building of new acoustic models, revision of the phone sets, etc. At the end of the reporting period, new systems containing improved acoustic models have been made available for Arabic, German, Polish, Spanish. For Spanish, a second system has been deployed, which also includes specific language models tailored to KRISTINA.
In the scope of the task on multilingual language analysis, three subtasks have been tackled: (i) annotation of multilayer layer (prosody, surface-syntax, deep-syntax and communicative structure) treebanks for selected languages covered in KRISTINA; (ii) parsing and analysis technologies; (iii) communicative analysis technologies. The research on parsing and analysis technologies covered syntactic and semantic parsing and derivation of ontological structures from parser output structures. Several cutting edge syntactic parsers have been developed, one of them, based on a novel machine learning model that consists of a new control structure for sequence-to-sequence neural networks, the Stack-LSTM, for a transition-based parsing model (or, more generally, an abstract-state machine). With respect to semantic analysis technologies, research focused on deep-syntactic parsing, which produces shallow semantic (lexeme-argument) structures. Investigation has been also carried out within the available graph transduction framework of the UPF on mapping deep-syntactic structures onto semantic structures using the available resources VerbNet, NomBank, PropBank, and FrameNet. The first version of a module that outputs FrameNet-based structures starting from dependency structures is available at https://github.com/talnsoftware/FrameSemantics_Parser. In addition, a preliminary framework for translating the extracted FrameNet-based annotated structures into respective OWL representations has been designed. The translation is aligned with the base ontology DOLCE Ultralight (DUL), and in particular with the with the pattern of the Descritions and Situations ontology (DnS). The latter has been specialised so as to capture the extracted structures as relational contexts (FrameSituation instantiations). Further specializations with respect to the investigated frame categories have been introduced in order to harmonise the intended FrameNet semantics with DUL’s conceptualization and ensure the coherency of the resulting representations.
In the context of communicative analysis technologies subtask, the annotation of “communicative dimensions” such as Thematicity has been worked on.
The work on non-verbal (mimics and gestures) communication analysis addressed a series of more specific tasks, in particular, facial expression and gesture recognition, recognition of social and emotional cues in facial expressions, gestures and in speech, and fusion of multiple social and emotional cues. In the context of facial expression recognition and gesture recognition full processing chain have been implemented. The pipeline for facial expression recognition consists of the following four stages: (1) face detection, using standard cascade detection; (2) localization of characteristic points of the facial geometry (landmarks), based on a combination of shape priors and regression models; (3) extraction of SIFT descriptors from the characteristic points; and (4) classification based on a hidden-task learning approach targeting Action Units (AUs).
The pipeline for gesture recognition starts with a detection of the face, from which iteratively the skin mask is obtained. The skin mask is used to estimate the position of the hands and to segment the regions of interest from the background. To the regions of interest Lucas-Kanade optical flow markers as well as first and second order velocity estimates are attached to track them over the frames. Based on these traces, movements are classified. The positions of the hands are mapped to a body estimation of the user to determine where on the body the user is indicating if the linguistic analysis suggests this information could be available in the gestural context.
For the classification of nonverbal emotional cues, the feasibility of various machine learning techniques has been explored. To ensure the availability of training material, a task on annotation of the recordings with respect to emotional cues has been initiated. For the analysis of paralinguistic features of emotional cues, first, the annotation of acoustic features (pitch, loudness and speech rate) in terms of emotion classes (neutral, positive, neutral) labels has been worked on with the goal to obtain training material for a machine learning model. The obtained annotation has been then used for experiments with k Nearest Neighbour, Naïve Bayes and Support Vector Machine classifiers. To integrate the verbal and nonverbal inputs for their analysis with respect to their social and emotional cues, UAU’s SSI framework has been integrated. SSI is particularly suited for this task due to its patch-based design which allows pipelines to be set up from autonomic components and to process sensor data from multiple input devices in a parallel and synchronized manner in real time.
In the context of the task on the fusion of multiple social and emotional cues, annotation of the material obtained during the recordings in terms of the valence - arousal has been worked on.
The work on ontologies in KRISTINA was preceded by a survey and analysis of relevant state of the art ontologies that can be used in KRISTINA for representing the web-based information pertinent to health-related topics, as well as the information derived from the communication between the user and the KRISTINA agent (lexical semantics of the user utterances, user-specific information, behaviour aspects, etc.). Key ontology modeling requirements have been identified, focusing on the purpose, scope, intended end-users, uses and ontology’s requirements. These requirements have driven the development of the first version of the KRISTINA ontologies that encode in a structured way the vocabulary and the precise semantics of information relevant to FrameNet conceptualisations, user profiles, behaviour aspects and medical information. They also account for the design of rules for mapping information at different levels of granularity and abstraction, supporting the derivation of higher-level interpretations through semantically enriched and interlinked knowledge graphs. Based upon the first version of the ontologies, a preliminary knowledge integration, reasoning and interpretation layer has been developed
In the scope of the work on web-based search for user relevant information, an initial implementation of a document search engine has been realized, and the first version of content extraction from the retrieved material has been implemented. A first version of a dynamic search query formulation that is based on the interaction of a user with the KRISTINA agent and the user needs derived from the content encountered in the ontologies have also been realized.
The focus of the research on expressive multimodal communication generation in the first year was on expressive spoken language production and expressive virtual character creation. In the context of expressive spoken language production, a novel stochastic deep syntactic generator and tree linearization have been worked on. For this purpose, annotation tasks for German, Polish, Spanish and Turkish have also been performed. Furthermore, a theoretical model of prosody in terms of acoustic parameters has been developed and the first version of the content-to-speech (CTS) module has been designed and realized to accept as input semantic, syntactic and morphological structures produced by the discourse generator and to output the speech waveform and a stream of commands that control lip synchronization. For advanced visual speech generation, a novel lip synchronization strategy has been implemented that is based on a Reduced Set of Commands for Expression (RSCE).
The first task in the context of the research on expressive virtual character creation has been to design a pipeline that allows for a rapid creation of virtual characters, with facial expressions and lip-sync support. Explorations towards the creation of realistic virtual characters, using, among other instruments a human skin shader, have been carried out. Then, the problem of the development of automated flexible facial animation has been addressed. An approach has been developed in which the emotional state is represented in a 2D emotional space, with the two axis ‘valence’ and ‘arousal’. In contrast to most of the facial animation approaches, which generate expressions from predefined emotions, we generate facial expressions procedurally using the 2D emotional space. Our system is very easy to implement across different virtual characters since the facial expressions are automatically generated from the predefined blend shapes. It controls the jaw bone to open and close the mouth and can also easily integrate the lip movements, given that the lip-sync is also done with a set of blend shapes. The facial animation and lip-sync currently need a set of nine blend shapes: smile, sad, kiss, lips closed, mouth full with air, eyebrows down, eyebrows up, eyebrows curved and eyelids. Complementing this work, UPF's web-based 3D scene editor, WebGL Studio has been extended to feature/play animated scenes, as required by KRISTINA. The 3D scene editor permits several features necessary for testing and developing prototype characters of the KRISTINA agent, such as scripting inside the application, mixture of multiple blend shapes through the GPU or CPU, GUI design, importing COLLADA files with animations and custom shaders. Finally, research has been carried out on efficient compression/transmission algorithms for 3D data over the web. Our current algorithms are capable of compressing the reference happy buddha 3D model from 85MB to 3.7Mb, and decompress ready for visualisation in the browser in just 350ms.
During the second year of the project, the achievements of the first year in the different fields will be further broadened and complemented with new advances.

Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)

Within the first year of its lifetime, KRISTINA made substantial progress beyond the state of the art in a number of fields.
In the context of dialogue management, the conception of the KRISTINA dialogue manager (DM) that does not rely on predefined system and user actions, as state-of-the-art DMs do, but, rather, handles dynamically created actions using reasoning technologies is to be mentioned.
In the context of language analysis, it is to be highlighted that since March 2015, the performance of Vocapia's speech-to-text systems have been improved between 14% (Turkish) and 32% (Spanish) on the available Kristina data and a novel machine learning model for syntactic parsing has been developed. The model consists of a new control structure for sequence-to-sequence neural networks, the Stack-LSTM, for an abstract-state machine. This control structure allows us to formulate efficient transition-based parsing models that captures three facets of a parse state: (i) unbounded look-ahead into the buffer of incoming words, (ii) the complete history of transition actions taken, and (iii) the complete contents of the stack of partially built fragments, including their internal structures. To be underlined is that this kind of models is not limited to syntactic dependency parsing. We applied stack-LSTMs also to phrase-structure parsing, language modeling, and named entity recognition with outstanding outcomes. Furthermore, considerable advances beyond the state of the art have been made with respect to semantic analysis in that an approach has been developed that projects deep-syntactic dependency parses to FrameNet-based structures (https://github.com/talnsoftware/FrameSemantics_Parser) and a framework has been defined for translating the extracted FrameNet-based annotated structures into respective OWL representations. The translation is aligned with the foundational ontololgy DOLCE Ultralight (DUL), and in particular with the pattern of the Descritions and Situations ontology (DnS). In this context, two further novel contributions have been made. Firstly, an ontology model has been defined to formally capture the FrameNet-based encoded semantics of user speech acts. The model serves as a conceptual interface between language analysis and knowledge integration, mediating the population of the knowledge base with user incoming information and its coupling with domain and background knowledge. To ensure the preservation of FrameNet semantics required for subsequent reasoning tasks, the DnS pattern has been adopted and extended. Secondly, a well-defined description of behavioural aspects has been proposed to foster knowledge sharing, reuse and interoperability. Following a pattern-based approach for capturing behavioural aspects, we have defined specialized instantiations of the DnS pattern that is part of the DOLCE+DnS Ultralite ontology. The developed patterns treat domain classes as instances to allow property assertions to be made among meta-patterns, and, thus, to enable the representation of contextualised views on behaviours and afford reusable pieces of knowledge that cannot be otherwise directly expressed by the standard ontology semantics.
From the perspective of nonverbal communication analysis, a significant contribution has been made on action-unit recognition for facial expression analysis from limited training data. The developed Hidden-task learning framework allows us to train Action Unit classifiers using additional data from larger datasets that contain only prototypical facial expression annotations (happiness, anger, sadness…). The use of this additional data has been shown to improve the performance of classifiers by increasing their generalization ability to new conditions such as head-pose, illumination or subject identity.
For paralinguistic analysis, art feature extraction and classification techniques that are common within the field emotion recognition from audio data are applied. However, in the literature, these techniques are most often used to analyse pre-recorded data and spoken sentences as a whole. Furthermore, the examined data tend to feature very strong emotional content at high recording quality. In contrast, within KRISTINA, we are working towards (close to) real-time recognition on frame-level and try to detect subtle emotional tones in very natural dialogues. Similarly, the fusion algorithm that is used in KRISTINA for multimodal affect recognition is state of the art, designed and successfully evaluated in the rather constrained field of audiovisual enjoyment recognition. We are currently adapting this approach to cover the whole affective valence-arousal space and introduce additional modalities, such as, e.g. body language and gestures. By expanding the recognition capabilities and broadening the set of observed affective channels, we aim at developing an accurate and flexible solution to detect natural emotions in real-time whilst dealing with the sensory restrictions of mobile devices.
The fusion framework that has been developed for unification of multimodal data at the semantic (ontological) level is novel. It targets high-level, integrated interpretation of the conversational context for solving disambiguation issues, such as reference resolution, coupling and fusing nonverbal (i.e., facial, gestural and emotional) and verbal features. It capitalizes on the use of OWL ontologies to describe in a formal manner domain entities and their relations, while contextual models and rules are defined at a higher level, capturing situations of interest along with their dependencies on the domain models.
As far as content extraction is concerned, a named entity and a concept relation extraction model that combines the accuracy of rule-based NLP templates with the robustness of an SVM classifier has been developed. Leveraging pattern-based methods and machine learning methods based on a variety of lexical, semantic and morpho-syntactic features, the framework goes beyond the state of the art. An enhanced number of compatible, yet domain independent, rules facilitates the easiness in recycling the tool’s core when considering broader domain proliferation, while the refined features selection employed by the machine learning module augments the chances of ambiguity resolution in the relation extraction task.
In the context of ontology-based querying of user profile details (such as opinions, habits, biographical data, etc.) in semantic dialogue systems, we developed an ontology-based framework. Combining advanced Semantic Web modelling and reasoning techniques, the framework goes beyond the traditional question answering approaches that provide answers only to simple queries (factoids), translating questions into SPARQL query patterns. Through multiple abstraction layers and knowledge-driven context interpretation algorithms, the framework's responses encapsulate the context (triples) relevant to specific questions, enabling its further semantic processing and integration at the dialogue management level.
Finally, several advances beyond the state of the art have been made in the area of multimodal communication generation. In particular to be mentioned is the work on multiple SVM-based syntactic sentence generation, which achieves higher accuracy scores than state-of-the-art realizers and the work on expressive character creation. In the context of the latter, simple web-based interactive expressive facial animation through valence-arousal values and web-oriented mesh compression is to be highlighted.

Related information

Record Number: 190022 / Last updated on: 2016-11-03
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top