Final Report Summary - EMOTAG (Emotionally-based Tagging of Multimedia Content)
In the first part of this project, the researcher has analyzed the spontaneous behavioral responses of users to mismatching tags with images. For example, an image depicting a standing person was shown with a tag that reads "sitting". The analysis of the spontaneous responses , i.e. agreement and disagreement with the displayed tags, resulted in obtaining some insight into how users respond to mismatching labels. First, it was found that after combining multiple users brain responses, it is possible to detect the event related response that is associated with mismatch, which is not possible to be detected in a single user response due to the high level of noise. Second, eye gaze pattern provides the most reliable signal in detecting mismatch. Users, in average, spend more time looking at the mismatching labels compared to the matching labels.
In the first year of the fellowship, the researcher, Dr. Soleymani, prepared, annotated and analyzed the spontaneous response to emotional videos. In the second year of his fellowship, he focused on detecting continuous emotions from brain waves or electroencephalogram (EEG) signals and facial expressions. He has also studied the inter-modality interaction between these two modalities and found that most of the emotionally informative content of the EEG signals are caused by the interference from facial muscles activity during expressions. He has developed and tested different methods for continuous affect detection, including continuous conditional random fields (CCRF) and deep long short term recurrent neural networks (deep LSTM-RNN). Our results showed that deep LSTM-RNN outperforms the other existing methods for continuous detection of affect. The advancement of the technology of the parallel processing and specifically the Graphic Processing Units (GPU) and Compute Unified Device Architecture (CUDA) libraries proposed by NVIDIA Inc. have facilitated the usage of multi-layer neural networks or deep architectures. We achieved the state of the art performance in valence detection on facial expressions and for the first time we reported continuous affect detection both in time and space using electroencephalogram signals (EEG).
Music as a form of art is intentionally composed to be emotionally expressive. The emotional features of music are invaluable for music indexing and recommendation. In collaboration with the researchers at, Drexel University, USA and Academia Sinica Taiwan , the researcher has also developed a new dataset for continuous emotional characterization of music. He created, for the first time, a public dataset of Creative Commons licensed songs which are continuously annotated by emotional dimensions. Each song received 10 annotations on Amazon Mechanical Turk and the annotations were averaged to form a ground truth. He has also led a comparative study of four systems for automatic music emotion recognition, which employed different feature sets and training schemes. The comparative study found that the deep recurrent neural networks are also effective in capturing the temporal dynamic of music.
The work conducted during this fellowship has been disseminated in the leading international venues including, IEEE Conference on Multimedia and Expo, 2014 and ACM Conference on Multimedia 2013, etc. The conducted research has advanced the state of the art in the automatic detection of human behavioral responses with the goal of multimedia retrieval and recommendation.