A significant portion of the work carried out this year has focused on implementing treatments to identify the emotional state of a user based on data from sensors. We primarily used four modalities to detect the emotional state of a user:
• Video streaming capturing facial expressions, as captured by a camera.
• Prosody, based on audio streaming captured from a microphone.
• Text data, which can be derived from manual input (for example, in the case of chatbots, online data, or a speech-to-text algorithm).
• Data from biometric sensors.
Following a study of the existing literature on each modality, we implemented differentiated approaches to capture the emotional state of a user:
• For video, we adopted a Deep Learning approach, notably using Convolutional Neural Networks (CNN) on the video stream
• Prosody and biocensors also rely on a machine learning approach, but based on metrics derived from a pre-processing phase using 'classic' signal processing algorithms.
• Lastly, text utilizes classical Natural Language Processing (NLP) algorithms combined with models associating semantic meaning of words or expressions with emotional intensities. In our initial approach, we attempted to use a Deep Learning method for consistency with other modalities, but contrary to our initial expectations, an NLP approach proved more fruitful.
We have also developed a multi-modal fusion algorithm. This algorithm analyzes the results obtained from each modality and, assuming the observation of the same phenomenon (synchronicity of data streams), computes the most probable emotional state derived from all modalities. This algorithm, derived from our internal research and not identified in academic literature, takes into account belief functions associated with each modality's ability to capture each emotion, the reliability of emotional measurements, and can accommodate any combination of sensors. This represents a strong competitive differentiator compared to existing approaches.
In particular, we created several validation datasets, both general and corresponding to specific use cases. Notable datasets include:
A 'natural' audio dataset dedicated to prosody. These recordings were tested with our model, allowing us to verify that the performance of our system is independent of the culture of the observed individual.
The videos represent various potential use cases of our technology:
• Clips from television programs, such as political debates.
• Streams from the eSports domain.
• For audio-only content, we annotated emergency call recordings (such as 911 calls in the United States).
We had the opportunity to engage with a major Japanese banking institution. The challenge for this bank was the analysis of transcriptions of recorded telephone conversations with their clients. These efforts allowed us to test our text-to-emotion algorithms on a language other than English (as the provided texts were in Japanese), which was the only language supported by our system until now.
For this work, we attempted three different approaches:
• Translation of the provided text into English. This approach proved unsuccessful as many nuances specific to Japanese were lost.
• Training on data translated into Japanese.
• Therefore, we opted for a dataset native to Japanese, augmented by the available data, and this latter approach proved very effective with performance rates approaching 95%.