Skip to main content

Automatic Detection of Conflict Escalation and Resolution in Social Interactions

Final Report Summary - CONFER (Automatic Detection of Conflict Escalation and Resolution in Social Interactions)

The goal of this project was the automatic analysis of conflict in dyadic interactions from real-world naturalistic audiovisual data. The project’s goal was rather challenging mainly due to 1) an omnipresent neglect of the fact that observed behaviours may be influenced by those of an interlocutor and thus require analysis of both interactants at the same time, and 2) an overall lack of suitable annotated data that could be used to train the machine learning detectors for recognition of conflict.

In the first part of this project, the researcher, Dr Yannis Panagakis, prepared and coordinated the annotation of data. In particular, videos have been extracted from more than 60 hours of live political debates, televised in between 2011 and 2012. In contrast with other benchmarks, political debates are real-world competitive multi-party conversations where participants do not act in a simulated context, but participate in an event that has a major impact on their real life (for example, in terms of results at the elections). Consequently, even if some constraints are imposed by the debate format, the participants have real motivations leading to real conflicts. From the entire dataset, 160 videos experts, with total duration 2h and 50 min, have been extracted. For each episode of conflict, the database also contains an episode of conflict-free interaction of the two people in question. Each video of the dataset is an audiovisual TV recording having both people involved in the dyadic episode in view.
The data have been annotated in terms of continuous conflict intensity as well as continuous valence and arousal by 10 expert annotators. The annotators assign conflict, valence and arousal intensity levels, in the range [0, 1], at each video frame by employing a joystick-based annotation tool, while they are watching each video excerpt in real time. They have been advised to annotate the videos by considering the physical (related to the behavior being observed) and inferential (related to the the interpretation of the discussion) layer of the conversation. The physical layer includes the behavioural cues observed during conflicts and include interruptions, overlapping speech, cues related to turn-organization in conversations as well as but head nodding, fidgeting and frowning. The inferential layer is based on the perception of the competitive processes where conflict is considered as a “mode of interaction” where “the attainment of the goal by one party precludes its attainment by the others”.For instance, conflicting goals often lead to attempts of limiting, if not eliminating, the speaking opportunities of others in conversations.To combine multiple annotators subjective judgements, the DynamicProbabilistic CCA with time warping has been employed, yielding an average annotation for each video exert.

To model the dynamics of conflict-related behaviour, audio and visual features have been extracted. In particular, the audio content of each episode in the dataset is parameterized in terms prosodic and spectral features, namely the pitch related feature, the mean and the RSM energy feature, as well as the Mel-frequency cepstral coefficients (MFCCs) and the Delta (differential) MFCCs. Facial behavioral cues related to conflict are head nodding, blinks, fidgeting, and frowning. Consequently, the conflict can be visually captured by tracking the head pose, lips, eyebrows, eyelids, and related facial characteristics of the interactants in video sequences by means of facial landmark points.

Since the collected data are real-world they are contaminated by gross errors, which are also temporally misaligned, i.e. temporal discrepancies manifest amongst the observation sequences. In practice, gross errors arise from either device artifacts (e.g. pixel corruptions, sonic artifacts), missing and incomplete data (e.g. partial image texture occlusions), or feature extraction failure (e.g. incorrect object localization, tracking errors). These errors rarely follow a Gaussian distribution. Furthermore, asynchronous sensor measurements (e.g. lag between audio and visual sensors) and view point changes, result into temporally misaligned sets of data. To handle adequately such noisy and temporally misaligned data several novel methods have been developed within this project for 1) audiovisual features demonising and correction (e.g. facial landmark point correction) 2) temporal alignment of audiovisual data, and 3) robust feature fusion in the presence of gross noise. The extensive experimental results in conflict recognition indicate that 1) the robust machine learning methods are able to provide accurate audiovisual features for conflict characterisation and 2) audio and visual features detect the conflict more accurately than the features which resort to a single modality (i.e. either audio or video). It is worth mentioning that, the developed methods are general purpose and can be employed for various audio,visual, and behaviour computing tasks where real-world noisy data need to be analysed.

All the data and the majority of the software developed within this project will be available through and the researchers’ personal website