Final Report Summary - ANASID (Analysing social interactions at a distance)
                                Data collection 
The original data for the project was recorded at Idiap Research Institute and is called the Idiap Poster Data. This data was processed to remove visible faces from the videos and also to mask out areas of the image frame that were not within the recorded area. Subsequently, 82 10s clips were selected and annotated for conversing groups, head pose, body orientation, and position.
To collect data of people conversing in groups, with audio information, we collaborated with other researchers in the group on a Master themed thesis project, that allowed many masters students to work together on different but related themes. This became a speed dating event, organised on campus and involved both speed dating phases, and also a party at the end. Two overhead cameras were used to record two speed dates at a time, and we borrowed some android mobile phones from other researchers in the faculty, as well installing our recording app onto the phones of the participants themselves so that we could record the audio of people while they were speaking. Despite pre-experiment tests, unfortunately most of the audio recordings failed to record.
Due the problems of checking sound quality and ensuring that it was all still recording, we purchased and used a wireless microphone recording system. In late 2012, we took advantage of research activities at a neighbouring research group on distributed wireless systems at the VU University of Amsterdam. In collaboration with them, we were able to access much larger crowds of socialising people than what was available to us as a single institution. In addition, the collaboration enabled us to pool our hardware resources so that we could record social data combining wearable sensors, video, and audio. A number of test runs were carried out to ensure that the hardware was working correctly, which we also used as ways of collecting diverse data. First, when testing the wearable sensors (specifically the accelerometer in the device), we recorded the inaugural lecture of Prof Krose. Sensors were distributed between other professors at the event, and also friends and family.
Then, we organised a more controlled experiment at the Intertain lab at the VU University of Amsterdam. This involved organising an event for 32 unacquainted people to meet, learn more about each other, form teams and take part in a quiz. From this event, we were able to record audio, video, accelerometer readings, internal positioning, and proximity information from each person - the microphones had to be switched to different participants part way through the data recording to maximise the diversity of the data. Finally, there were problems with the quality of the video recordings as the fish eye cameras did not record the frames smoothly due to buffering effects, that may have been due to the dimmer lighting conditions causing strain on the Central processing unit (CPU) during the compression of the recorded videos. Subsequently, we used the data to test the hypothesis that body motion alone could be used to model the behaviour of speakers and listeners. This data set has been synchronised and parts of it have been annotated for when someone is speaking, laughing, stepping / walking, gesturing (head or hand), or drinking.
Research achievements summary
Research was carried out understanding the novel problem of detecting conversing groups. We proposed that by considering groups to be like maximal cliques in an edge-weighted graph, where the people in the scene are the nodes, and the distance between them, represented by a measure of affinity. Using a method called dominant sets, we were able to identify well, where the conversing groups were. We extended the method by adding an additional stopping criterion to the recursive procedure.
We achieved state of the art accuracy of 86.83 % F-measure per group, compared to 76.57 % F-measure of an existing method (Yu et al., CVPR 09) (which uses modularity cut clustering) when using only the manually labelled positions of people. When using both position and body orientation, the performance was better, achieving 92.24 % F-measure for our method, compared to 92.02 % F-measure for the method of Yu et al (Yu et al., CVPR 09). Through this exercise, we also derived a method of estimating a person's direction of orientation, based on just their position relative to others. This achieved an accuracy of mean error of 12.9 with a 14.3 degree standard error on our dataset. A noise analysis of our method compared to the method of Yu et al. (Yu, CVPR 09) also yielded significantly more robust performance when the positions of people were jittered by increasing amounts of noise.
Later our method was compared to the conversing group detection method of Cristani et al. (Cristani et al., BMVC 2011). This work has been submitted to WIAMIS 2013. This work was carried out in collaboration with Cristani's group. Comparing on our methods on each other's respective data sets highlighted some interesting deficiencies in both methods. Cristani's method relies on a stochastic voting procedure where the estimated direction of orientation of each person generates a number of noisy estimates of where the centre of the conversing that they are involved in could be. By additionally applying a criterion to ensure that the centre of a conversing group does not contain any people, they are able to identify people who are in a group. Our comparative results showed that our method performed better when using just position information (87 % F-measure per frame for our method, versus 71 % for Cristani's method). Using position and head orientation information, their method outperformed ours (94 verss 92 %). The findings suggest that we should be considering a method that combines the strengths of both methods, which is robust to both position and head pose estimates.
The next task we addressed was about estimating speakers. Since the project originally proposed to harness motion features for estimating speakers in the groups, and since the video data was finally too noisy for performing any feature extraction, we harnessed motion from the single body worn accelerometer that was hung around the neck of each participant. This led to a number of fruitful experimental outcomes. The first was in detecting speaking status from acceleration only. Spectral features were extracted and then Hidden Markov Models were trained on both the positive and negative class for each activity. Experiments showed very good results for speaker independent estimations of speaking (64, 82 and 72 % precision, recall and f-measure respectively). High precision (100 %), though low recall of 21, 21 and 38 % was achieved with estimating stepping, drinking, and laughter respectively. The results from these experiments have been submitted to UBICOMP 2013. A follow-up study to analyses the feasibility of using these cues to detect social groups shows great potential for further research - people in the same interacting group tend to speak at the same time less, and also speak and laugh at the same time less. Speaking of one person co-occurs highly with stepping from others compared to people who are not in the same group. Therefore, even though the recall of some of the activities such as stepping, drinking, and laughter were low, since the precision was very good, this can probably be used as a reliable indicator of being in the same conversing group.
In terms of estimating social aspects of people's behaviour, the speed date data was used to conduct experiments on estimating attraction. For this data, the video was of sufficient quality to extract motion features related to how each person moved. By tracking the person's centroid over time, features derived from their motion was extracted and used to estimate attraction levels. Our results showed performances that were significantly above the baseline, achieving a classification accuracy of 70 % (59 % baseline) when predicting whether the man was attracted to the woman, and 69 % (56 % baseline) when predicting whether the woman was attracted to the man. In both cases, the variance in motion of the woman during the date was the feature that best predicted the attraction. This work was published and presented at the ICCV workshop on 'Social surveillance'.
						
                        
                        					
                    
                    
                    
                    
                    
                                        
                    
                                        
				The original data for the project was recorded at Idiap Research Institute and is called the Idiap Poster Data. This data was processed to remove visible faces from the videos and also to mask out areas of the image frame that were not within the recorded area. Subsequently, 82 10s clips were selected and annotated for conversing groups, head pose, body orientation, and position.
To collect data of people conversing in groups, with audio information, we collaborated with other researchers in the group on a Master themed thesis project, that allowed many masters students to work together on different but related themes. This became a speed dating event, organised on campus and involved both speed dating phases, and also a party at the end. Two overhead cameras were used to record two speed dates at a time, and we borrowed some android mobile phones from other researchers in the faculty, as well installing our recording app onto the phones of the participants themselves so that we could record the audio of people while they were speaking. Despite pre-experiment tests, unfortunately most of the audio recordings failed to record.
Due the problems of checking sound quality and ensuring that it was all still recording, we purchased and used a wireless microphone recording system. In late 2012, we took advantage of research activities at a neighbouring research group on distributed wireless systems at the VU University of Amsterdam. In collaboration with them, we were able to access much larger crowds of socialising people than what was available to us as a single institution. In addition, the collaboration enabled us to pool our hardware resources so that we could record social data combining wearable sensors, video, and audio. A number of test runs were carried out to ensure that the hardware was working correctly, which we also used as ways of collecting diverse data. First, when testing the wearable sensors (specifically the accelerometer in the device), we recorded the inaugural lecture of Prof Krose. Sensors were distributed between other professors at the event, and also friends and family.
Then, we organised a more controlled experiment at the Intertain lab at the VU University of Amsterdam. This involved organising an event for 32 unacquainted people to meet, learn more about each other, form teams and take part in a quiz. From this event, we were able to record audio, video, accelerometer readings, internal positioning, and proximity information from each person - the microphones had to be switched to different participants part way through the data recording to maximise the diversity of the data. Finally, there were problems with the quality of the video recordings as the fish eye cameras did not record the frames smoothly due to buffering effects, that may have been due to the dimmer lighting conditions causing strain on the Central processing unit (CPU) during the compression of the recorded videos. Subsequently, we used the data to test the hypothesis that body motion alone could be used to model the behaviour of speakers and listeners. This data set has been synchronised and parts of it have been annotated for when someone is speaking, laughing, stepping / walking, gesturing (head or hand), or drinking.
Research achievements summary
Research was carried out understanding the novel problem of detecting conversing groups. We proposed that by considering groups to be like maximal cliques in an edge-weighted graph, where the people in the scene are the nodes, and the distance between them, represented by a measure of affinity. Using a method called dominant sets, we were able to identify well, where the conversing groups were. We extended the method by adding an additional stopping criterion to the recursive procedure.
We achieved state of the art accuracy of 86.83 % F-measure per group, compared to 76.57 % F-measure of an existing method (Yu et al., CVPR 09) (which uses modularity cut clustering) when using only the manually labelled positions of people. When using both position and body orientation, the performance was better, achieving 92.24 % F-measure for our method, compared to 92.02 % F-measure for the method of Yu et al (Yu et al., CVPR 09). Through this exercise, we also derived a method of estimating a person's direction of orientation, based on just their position relative to others. This achieved an accuracy of mean error of 12.9 with a 14.3 degree standard error on our dataset. A noise analysis of our method compared to the method of Yu et al. (Yu, CVPR 09) also yielded significantly more robust performance when the positions of people were jittered by increasing amounts of noise.
Later our method was compared to the conversing group detection method of Cristani et al. (Cristani et al., BMVC 2011). This work has been submitted to WIAMIS 2013. This work was carried out in collaboration with Cristani's group. Comparing on our methods on each other's respective data sets highlighted some interesting deficiencies in both methods. Cristani's method relies on a stochastic voting procedure where the estimated direction of orientation of each person generates a number of noisy estimates of where the centre of the conversing that they are involved in could be. By additionally applying a criterion to ensure that the centre of a conversing group does not contain any people, they are able to identify people who are in a group. Our comparative results showed that our method performed better when using just position information (87 % F-measure per frame for our method, versus 71 % for Cristani's method). Using position and head orientation information, their method outperformed ours (94 verss 92 %). The findings suggest that we should be considering a method that combines the strengths of both methods, which is robust to both position and head pose estimates.
The next task we addressed was about estimating speakers. Since the project originally proposed to harness motion features for estimating speakers in the groups, and since the video data was finally too noisy for performing any feature extraction, we harnessed motion from the single body worn accelerometer that was hung around the neck of each participant. This led to a number of fruitful experimental outcomes. The first was in detecting speaking status from acceleration only. Spectral features were extracted and then Hidden Markov Models were trained on both the positive and negative class for each activity. Experiments showed very good results for speaker independent estimations of speaking (64, 82 and 72 % precision, recall and f-measure respectively). High precision (100 %), though low recall of 21, 21 and 38 % was achieved with estimating stepping, drinking, and laughter respectively. The results from these experiments have been submitted to UBICOMP 2013. A follow-up study to analyses the feasibility of using these cues to detect social groups shows great potential for further research - people in the same interacting group tend to speak at the same time less, and also speak and laugh at the same time less. Speaking of one person co-occurs highly with stepping from others compared to people who are not in the same group. Therefore, even though the recall of some of the activities such as stepping, drinking, and laughter were low, since the precision was very good, this can probably be used as a reliable indicator of being in the same conversing group.
In terms of estimating social aspects of people's behaviour, the speed date data was used to conduct experiments on estimating attraction. For this data, the video was of sufficient quality to extract motion features related to how each person moved. By tracking the person's centroid over time, features derived from their motion was extracted and used to estimate attraction levels. Our results showed performances that were significantly above the baseline, achieving a classification accuracy of 70 % (59 % baseline) when predicting whether the man was attracted to the woman, and 69 % (56 % baseline) when predicting whether the woman was attracted to the man. In both cases, the variance in motion of the woman during the date was the feature that best predicted the attraction. This work was published and presented at the ICCV workshop on 'Social surveillance'.
 
           
        