Periodic Reporting for period 2 - HOL-DEEP-SENSE (Holistic Deep Modelling for User Recognition and Affective Social Behaviour Sensing)
Berichtszeitraum: 2020-04-01 bis 2021-03-31
The major scientific achievement in the outgoing period is holistic affect recognition using deep learning techniques, which implements two main aspects of the HOL-DEEP-SENSE project. In the following, the research outputs are described.
In Affective Computing, research has aimed at endowing machines with emotional intelligence, which should support collaboration and interaction with human beings. Despite the performance achievements of today's systems, they are mostly designed to recognize emotion alone. However, considerations in both neuroscience and psychology have identified contextual cues as playing a central role in human perception of other people's emotion; we attend to individual differences in emotion expression, which are due to personal factors and social influence. Hence, analysing contextual information conveyed in a speaker's voice helps personalize emotion recognition technologies, which will enable emotionally and socially intelligent conversational AI.
In the published paper "PaNDA: Paralinguistic Non-metric Dimensional Analysis for Holistic Affect Recognition", a total of 18 speaker attributes are learned together: Negative emotion, age, interest, intoxication, sleepiness, Big-five personality traits (Openness, Conscientiousness, Extroversion, Agreeableness, Neuroticism), conflict, arousal, valence, cognitive load, physical load, sincerity, cold, and stress. Detailed explanation of the approach and discussion of results can be found in the open access publication of the paper (https://dspace.mit.edu/handle/1721.1/123806(öffnet in neuem Fenster)) as well as on the project webpage (https://www.media.mit.edu/projects/hol-deep-sense/overview/(öffnet in neuem Fenster)).
Another major achievement during the project phase is the development of machine learning methods for efficient data annotation, leading to one journal published in the IEEE Transactions on Cybernetics ("A Generic Human-Machine Annotation Framework Based on Dynamic Cooperative Learning") and another journal submitted to the Journal of Machine Learning Research opensource track. The new software will be made publically available for the research community upon publication and is broadly applicable for machine learning and data mining tasks that require multi-target training sets. It also provides the first open-source implementation of a multi-task shared hidden layer DNN capable of handling missing labels, based on the work presented above.
In ongoing research, the ER has been working towards multi-modal end-to-end learning. To this end, she has developed an end-to-end learning speech processing system that takes raw audio as input, thereby dispensing with the need for traditional feature extraction. Since the multi-task output and the deep analysis part have already been implemented, the next step would be to conjoin the multi-modal frontend and multi-task DNN.