Community Research and Development Information Service - CORDIS

Final Report Summary - NOVICOM (Automatic Analysis of Group Conversations via Visual Cues in Non-Verbal Communication)

In the NOVICOM project, conducted at the Social Computing group at Idiap, we are exploring models that can estimate social behavior from both audio and visual nonverbal cues, with a specific focus on visual cues. We concentrated on a selected number of key research tasks in social interaction analysis. These include the automatic estimation of dominance in groups, the emergence of leadership, and personality. In these situations, people unconsciously display visual cues, in the form of gestures and body postures, which partly reveal their social attributes. For each task, our specific objectives are twofold. First we attempt to automatically detect the visual nonverbal cues that are displayed during interaction. Second, we investigate multimodal approaches that integrate audio and visual nonverbal cues to infer social concepts.

For our research, we used publicly available data emerging from previous EC-funded projects at Idiap, as well as newly collected data. For the new data, we were mainly interested in recording natural conversations. For this purpose, we have designed a portable audio-visual recording system, which includes two webcams and a microphone array and is able to record four people sitting around a table. This work has involved in an international collaboration with a startup company that is developing new audio capture devices. Our system allows for the capture of group interaction outside the laboratory with people who volunteer to participate in a number of group discussion scenarios.

As a first task, we have concentrated on the modeling of dominance. We have investigated the effect of visual nonverbal cues to estimate the most dominant and least dominant person in a group conversation. For each meeting, we process the audio and visual recordings obtained from microphones and cameras. For the audio, one crucial task is to segment each speaker (i.e. who speaks when) from the microphone data. For the video, we find where the participants are and track their visual activity. Once we have this information, we extract several audio-visual nonverbal features for each participant, including audio features like speaker turns, interruptions, and speech energy; and visual features such as total body activity, usage of head gestures, and hand gestures. For inference, we developed multimodal fusion techniques to utilize audio and visual nonverbal information jointly. We are able to achieve accuracies around 90% to estimate the most dominant person in the meeting, in comparison to human judgments, using fused audio and visual nonverbal cues. Another task we have worked on is the identification of emergent leadership in small groups. For this purpose, we collected a new dataset with the portable recording system, which includes meeting recordings of newly formed groups trying to solve a task. We performed a multimodal fusion approach to fuse different audio and visual nonverbal cues and we observed around 80% accuracy of estimating the emergent leader in the group. Both of these studies showed that the visual information is necessary and should be used together with audio in order to achieve better performance in dominance and emergent leadership estimation. As another dimension of social behaviour analysis, we investigated the use of automatically extracted audio and visual nonverbal cues as descriptors of personality. We showed that the combination of audio and visual cues can be used to predict 34% of the variance for the Extraversion personality trait, which corresponds to the most reliably judged personality trait.

Related information

Documents and Publications

Reported by

Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top