Skip to main content
Aller à la page d’accueil de la Commission européenne (s’ouvre dans une nouvelle fenêtre)
français français
CORDIS - Résultats de la recherche de l’UE
CORDIS

Holistic Deep Modelling for User Recognition and Affective Social Behaviour Sensing

Periodic Reporting for period 2 - HOL-DEEP-SENSE (Holistic Deep Modelling for User Recognition and Affective Social Behaviour Sensing)

Période du rapport: 2020-04-01 au 2021-03-31

The HOL-DEEP-SENSE project aims at holistic machine perception of human charateristics such as demographic traits (age, gender), emotion, personality and physical and mental conditions. The machine learning methods developed in this project help personalize AI technologies for natural human-computer interaction. Using state-of-the-art deep learning algorithms, the project addresses the major shortcoming in today’s recognition systems that recognise affective states (e.g. emotion, sleep deprivation, depression) and other user characteristics separately. Therefore, the project seeks to understand the interrelationship between various human phenomena in order to enable human-like machine perception for emotionally and socially intelligent AI. In particular, the overarching objective of the HOL-DEEP-SENSE project is end-to-end multi-input multi-output learning, i.e. from multi-modal raw signals (audio, visual, physiological) on the front-end through hidden feature computation (in deep neural networks) to joint prediction of multiple targets (multi-task learning).
In the fulfilment of the DoA, the experienced researcher (ER) has carried out ground research towards holistic machine perception of human phenomena as a whole, using deep learning models to jointly recognize emotion, age, illness, and many other user attributes. In her work on holistic affect recognition, the ER has successfully brought together the field of Computational Paralinguistics (her PhD background) and Affective Computing (expertise of the host institution).
The major scientific achievement in the outgoing period is holistic affect recognition using deep learning techniques, which implements two main aspects of the HOL-DEEP-SENSE project. In the following, the research outputs are described.
In Affective Computing, research has aimed at endowing machines with emotional intelligence, which should support collaboration and interaction with human beings. Despite the performance achievements of today's systems, they are mostly designed to recognize emotion alone. However, considerations in both neuroscience and psychology have identified contextual cues as playing a central role in human perception of other people's emotion; we attend to individual differences in emotion expression, which are due to personal factors and social influence. Hence, analysing contextual information conveyed in a speaker's voice helps personalize emotion recognition technologies, which will enable emotionally and socially intelligent conversational AI.
In the published paper "PaNDA: Paralinguistic Non-metric Dimensional Analysis for Holistic Affect Recognition", a total of 18 speaker attributes are learned together: Negative emotion, age, interest, intoxication, sleepiness, Big-five personality traits (Openness, Conscientiousness, Extroversion, Agreeableness, Neuroticism), conflict, arousal, valence, cognitive load, physical load, sincerity, cold, and stress. Detailed explanation of the approach and discussion of results can be found in the open access publication of the paper (https://dspace.mit.edu/handle/1721.1/123806(s’ouvre dans une nouvelle fenêtre)) as well as on the project webpage (https://www.media.mit.edu/projects/hol-deep-sense/overview/(s’ouvre dans une nouvelle fenêtre)).
Another major achievement during the project phase is the development of machine learning methods for efficient data annotation, leading to one journal published in the IEEE Transactions on Cybernetics ("A Generic Human-Machine Annotation Framework Based on Dynamic Cooperative Learning") and another journal submitted to the Journal of Machine Learning Research opensource track. The new software will be made publically available for the research community upon publication and is broadly applicable for machine learning and data mining tasks that require multi-target training sets. It also provides the first open-source implementation of a multi-task shared hidden layer DNN capable of handling missing labels, based on the work presented above.
In today’s speech analysis systems, emotion recognition has been tackled with single-task learning algorithms that model emotion as an isolated phenomenon. The ER’s background is Computational Paralinguistics, i.e. the analysis of vocal cues to speaker characteristics. In the work described in the above section, the ER has successfully united the field of Computational Paralinguistics with the field of Affective Computing, founded by the ER’s supervisor at MIT Media Lab, Prof. Rosalind Picard. The key idea is to recognise emotion along with personal attributes including age, gender, personality, health condition, cultural background etc. The machine learning models developed in the ER’s works can be leveraged to endow AI with human-like perceptual, reasoning and learning abilities that support collaboration and communication with human beings. They will also help personalize AI technologies for natural human-computer interaction. In particular, auditory speech analysis is of high practical relevance since speech is one of, if not the, most natural human communication mechanism, and thus is useful for computer-assisted pronunciation training, safety and security monitoring (detection of alcohol intoxication and sleepiness), digital healthcare, speaker diarisation (assessing “who speaks when”) and voice-operated personal assistants (Amazon Alexa, Google Home) in IoT (Internet of Things)-inspired smart home environments.
In ongoing research, the ER has been working towards multi-modal end-to-end learning. To this end, she has developed an end-to-end learning speech processing system that takes raw audio as input, thereby dispensing with the need for traditional feature extraction. Since the multi-task output and the deep analysis part have already been implemented, the next step would be to conjoin the multi-modal frontend and multi-task DNN.
Overview of TAC journal
Mon livret 0 0