Skip to main content
European Commission logo print header

Suspicious and Abnormal Behaviour Monitoring Using a Network of Cameras & Sensors for Situation Awareness Enhancement

Final Report Summary - SAMURAI (Suspicious and Abnormal Behaviour Monitoring Using a Network of Cameras & Sensors for Situation Awareness Enhancement)

Executive Summary:
SAMURAI has developed and integrated an innovative intelligent surveillance system for robust monitoring of both inside and surrounding areas of a critical public infrastructure site (an airport terminal). SAMURAI has four significant novelties that make it distinctive from other recent and ongoing relevant activities both in the EU and elsewhere. They are: (1) SAMURAI has developed a system that employs networked heterogeneous sensors rather than CCTV cameras alone so that multiple complementary sources of information are fused to create a visualisation of a more complete ‘big picture’ of a crowded public space. (2) Existing systems focus on analysing recorded video using pre-defined hard rules, suffering from unacceptable false alarms. SAMURAI has developed an adaptive behaviour profiling and abnormality detection system for alarm event alert and prediction with much reduced false alarm. (3) SAMURAI has developed novel algorithms for distributed non-overlapping multi-camera people re-identification in a large public space, critical for enabling global situational awareness by `connecting the dots' from different camera views. (4) SAMURAI has developed active learning mechanisms for man-in-the-loop user feedback input from control room operators for a hybrid context-aware based abnormal behaviour recognition. This is in contrary to current video behaviour recognition system that relies purely on information extracted from the video data, often too ambiguous to be effective.

The technologies developed by SAMURAI have greatly extended the capabilities of the existing video analytics and intelligent surveillance systems currently on the market. Specifically, the SAMURAI project has (a) developed innovative tools and systems for people, vehicle and luggage detection, tracking, re-identification, and type categorisation across a network of cameras under real world conditions; (b) developed an abnormal behaviour detection system based on a heterogeneous sensor network consisting of both fix-positioned CCTV cameras and mobile wearable cameras with Wi-Fi positioning sensors. These networked heterogeneous sensors function cooperatively to provide enhanced situation awareness; (b) developed innovative tools using multi-modal data fusion and visualisation of heterogeneous sensor input to enable more effective control room operator queries.

Project Context and Objectives:
The European citizen and society are facing immediate and immense security challenges, in particular following the 2004 Madrid train bombing and 2005 London transport bombing. Governments and taxpayers across the EU are being called on to provide unprecedented resources for maintaining public safety and security. It has been realised that innovative technologies, particularly computer vision and video based surveillance have the potential to assist in addressing this challenge.

The aim of SAMURAI is to develop and integrate an innovative surveillance system for robust monitoring of both inside and surrounding areas of a critical public infrastructure site. SAMURAI has three significant novelties that make it distinctive from other recent and ongoing relevant activities both in the EU and elsewhere:
(1) SAMURAI is to employ networked heterogeneous sensors rather than CCTV cameras alone so that multiple complementary sources of information can be merged to create a visualisation of a more complete 'big picture' of a crowded public space.
(2) Existing systems focus on analysing recorded video using pre-defined hard rules, suffering from unacceptable false alarms. SAMURAI is to develop a real-time adaptive behaviour profiling and abnormality detection system for alarm event alert and prediction with much reduced false alarm.
(3) In addition to fix-positioned CCTV cameras, the SAMURAI system will also take command input from control room operators and mobile sensory input for patrolling security staff for a hybrid context-aware based abnormal behaviour recognition. This is in contrary to current video behaviour recognition systems that rely purely on information extracted from the video data, often too ambiguous to be effective. The main concept of the SAMURAI system is illustrated in the SAMURAI concept diagram (see Figure 1 in the attachment).

SAMURAI aims to develop underpinning capabilities and tools for an innovative intelligent surveillance system to be used for the monitoring of both inside and surrounding areas of a critical public site (e.g. an airport concourse, an underground platform). The novelties of SAMURAI, which distinguish this project from other relevant activities both in the EU and elsewhere, are:
1. SAMURAI employs networked heterogeneous sensors rather than isolated visual sensors (e.g. standalone CCTV cameras) so that multiple complementary sources of information are fused to visualise a more complete and ‘big picture’ of the area under surveillance.
2. SAMURAI advances intelligent video analytics beyond current post-incident analysis of a criminal/terrorist activity in recorded data. SAMURAI develops an online adaptive behaviour monitoring system for real-time abnormal behaviour detection and triggering of context-aware alerts in assisting the prevention of crime.
3. SAMURAI integrates fix-positioned CCTV video input with control room operator queries and mobile sensory input from patrolling staff for more effective man-in-the-loop decision-making. This is in contrast to current systems relying purely on processing sensory data.

Specifically, SAMURAI makes the following innovative and critical contributions:

• Robust detection using a distributed camera network: SAMURAI will develop and integrate underpinning capabilities to enable robust detection, categorisation and tagging continuously across a distributed network of cameras over space and time under different weather/lighting conditions. The targets of these processes include people (body appearance and faces), vehicle (appearance and type) and luggage. The measurable objectives are:
a. To develop a background subtraction algorithm which will be robust and reliable under all weather and lighting conditions;
b. To develop algorithms for robust detection and categorisation of people, vehicles, and luggage which will cope with occlusion and large variation of appearance through changes in viewpoint and scale.

• Real-time distributed detection of abnormal behaviour: SAMURAI will develop a behaviour monitoring model for abnormal behaviour detection in different camera views given by a distributed heterogeneous sensor network consisting of fixed (standard, PTZ, possibly infrared) and mobile cameras, and wearable audio/video sensors. The heterogeneous sensors will provide selective abnormal behaviour analysis and reduce significantly false alarms in the detection and prediction of alert situations. The measurable objectives are:
a. To develop a model for establishing temporal and spatial correspondence of object whereabouts across different camera views using multi-modal sensors;
b. To develop a behaviour model that can cope with changes in context/viewpoint and definitions of abnormality, and with online adaptive abnormal behaviour detection;
c. To develop tools and techniques for machine-assisted focus of attention using PTZ cameras, wearable audio and positioning sensors with cameras.

• Acquisition, analysis, and fusion of multi-source information: SAMURAI will develop a framework to facilitate the acquisition, analysis and fusion of information from multiple and distributed data sources. Its primary objective is to transform a large volume of information into useful knowledge that can be presented in a timely fashion to an operator or control system. The measurable objectives are:
a. To determine the level of spatial and temporal registration and alignment of multi-modal data required to provide enhanced situation awareness of the SAMURAI system;
b. To develop a coherent and robust methodology for 3D visualisation and interpretation that enables an user to understand, and respond to, an evolving situational awareness picture;
c. To develop and integrate a data fusion architecture/algorithm that can provide effective and efficient data selection, and is capable of identifying and correcting data conflicts and system errors;
d. To develop and integrate a set of tools and techniques for the design and implementation of effective system user interfaces.

• Rigorous assessment against user requirements and existing systems: Assess and evaluate the technical advances from the above SAMURAI capabilities in the context of external user requirements through consultations with the independent User Advisory Group. The measurable objectives are:
a. To organise User Consultation Workshops and Final Project Workshop;
b. To assess working laboratory-based systems against user requirements, on both recorded data collected from consortium user partner sites and from live input;
c. To evaluate the developed SAMURAI system against existing state-of-the-art intelligent surveillance systems to demonstrate the advantages achieved in this project in terms of higher detection rate of abnormal behaviour, lower false alarm rate, and enhanced real-time performance.

Project Results:
The project has achieved the following scientific and technological results/foregrounds.

(A) Object detection and categorisation

The objective of Work Package 2 is to develop and integrate tools and techniques for performing robust and accurate object detection and categorization from real-world surveillance videos. The work package consists of three tasks:
1. Task 2.1: reliable background subtraction under real conditions
2. Task 2.2: robust detection and categorization of people from a wide-area distance by appearance and action
3. Task 2.3: robust detection and categorization of luggage and vehicles under occlusions, variable lighting and weather conditions

Background subtraction (Task 2.1) is a standard operation for the automated video-surveillance scientific community. Its aim is to highlight every moving object (the foreground, FG) which could represent an interesting entity (person, luggage, and vehicle) that needs to be further investigated. In practice, this operation is carried out by a software filter, i.e. an algorithm, operating on the video-sequences coming from each camera; for the SAMURAI goals, the algorithm needs to be developed for both outdoor and indoor environments, under variable lighting conditions with possible sudden changes. Different weather conditions and related effects are to be modeled as background. More specifically, a BG model is developed based on total adaptation: the features or parameters of a scene and moving objects are selected on-line, resulting in only effective features used for background/foreground discrimination.

In order to approach SAMURAI’s goals, in the first two years we developed two different strategies named ASTNA 2.4 and IMBS 2.5. The first is a hybrid per-pixel and per-region approach. It is known from the literature that background subtraction methods based on joint pixel and region analysis are proven to be effective in discovering foreground objects in cluttered scenes. Typically, per-pixel foreground detection is contextualized in a local neighborhood region in order to limit false alarms. However, these methods have a heavy computational cost, depending on the size of the surrounding region for each pixel. The method we developed proposes an original and efficient joint pixel-region analysis technique able to automatically select the sampling rate with which pixels in different areas are checked out, while adapting the size of the neighborhood region considered.

The second strategy is a real-time non-recursive pixel-based technique based on a statistical analysis of a buffer of frames. Background is modeled as the set of clusters maximally stable throughout this buffer. For each background pixel, a set of possible color values is learnt. A pixel p belongs to the foreground if its value differs from those admissible for the background. Such a solution allows for managing short-term background variations, namely noise in sensor data, changes in illumination conditions and movement of small background elements. The background long-term variations are dealt with by re-computing the model (i.e. analyzing a new subsequence) with a certain frame frequency.

We tested the capabilities of the two approaches on a standard benchmark dataset, concentrating on those test videos that face the kind of problems SAMURAI aims to deal with. The results show that our techniques outperform the state of the arts in terms of both accuracy and efficiency, and that IMBS shows the highest performance. Therefore, we developed a real time version of IMBS. This part is discussed in Deliverable 2.1. A review of the most important methods for background subtraction has been published in:
"M. Cristani, M. Farenzena, D. Bloisi, V. Murino, Background Subtraction for Automated Multisensor Surveillance: a Comprehensive Review, EURASIP Journal on Advances in Signal Processing, vol. 2010, Article ID 343057, 2010."

IMBS cannot handle reflection of the moving objects on the floor. To this end, we concentrated on improving their efficiency by focusing on reflection elimination. We developed two strategies. The first is the implementation of a symmetry detection method based on SIFT descriptors. The idea is that the interest points in the original image should be replicated in the reflected part, creating a symmetry relation that is discovered by a graph-based algorithm. The second strategy is based on the structured nature of the reflection of objects. The reflection, considering the floor matter, will be blurry, preserving to some extent an altered version of the color components. Yet, it is low-structured. Therefore, we take advantage of analyzing the structure of the region containing both object and its reflection to separate them from each other. The results are good, but much work is needed for applying the technique in real crowded cases. This part of work is described in deliverable 2.2.

Regarding the people detection and categorization issue (Task 2.2); a robust people detector was developed for applications from a wide-area distance, robust to lighting condition and occlusions. The goal is to reveal the presence of a pedestrian by drawing a bounding box surrounding each person. We approach this goal, by proposing a human detection module that operates on a set of reliable features for people representation. Among the different features considered for feeding a discriminative classifier (i.e. boosting), covariance matrix features have been exploited as powerful descriptors of pedestrians and their effectiveness has been proved by some comparative papers at the state-of-the-art. This part is discussed in Deliverable 2.1 and produced the following publications:
"D. Tosato, M. Farenzena, M. Spera, V.Murino M.Cristani Multi-class Classification on Riemannian Manifolds for Video Surveillance, European Conference on Computer Vision, 2010."

In the subsequent period, we focused on enhancing the robustness and the efficiency of the detector. To this end, a new part-based human layout is introduced, which is based on the concept of human body parts. In the same time, we generalize all the previous method in a general architecture for detection and classification for video surveillance that we have called ArCO (ARray of COvariances), employing it to detect pedestrians and classifying their head orientation. This version has been employed for the detection of people in the SAMURAI test cases; it has described in Deliverable 2.2 and brought to the following publications:
"D. Tosato, M. Farenzena, M.Cristani and V. Murino. Part-based human detection on Riemannian manifolds. IEEE International Conference on Image Processing, 2010."

"D. Tosato, M. Farenzena, V. Murino, and M. Cristani. A re-evaluation of pedestrian detection on Riemannian manifolds, International Conference on Pattern Recognition, 2010."

The capability of classifying small objects with high precision allowed us to perform very subtle analysis on the detected pedestrians, for example classifying their head orientation. This, connected to a tracking framework published in:
"L. Bazzani, M. Cristani, M. Bicego, V. Murino. Subjective online feature selection for occlusion management in tracking applications, International Conference on Image Processing, 2009."

"L. Bazzani, M. Cristani and V. Murino. Collaborative particle filters for group tracking. IEEE International Conference on Image Processing, Hong Kong, September 2010."

"L. Bazzani, D. Bloisi, V. Murino. A Comparison of Multi Hypothesis Kalman Filter and Particle Filter for Multi-target Tracking, CVPR Workshop on Performance Evaluation of Tracking and Surveillance, Miami, 2010."
allow to individuate at each frame the position of people and their subjective view frustum SVF. It is well known that deliberately directing the SVF to a certain object for a certain time may mirror an attentional mechanism, aimed at continuously updating the object’s properties in the human memory. Considering this principle, we exploit the SFV as a mean to detect those areas of the scene at which the observers look the most, creating the so called interest maps. The research has been published in

"M. Farenzena, L. Bazzani, and V. Murino, M. Cristani. Towards a Subject-Centered Analysis for Automated Video Surveillance, International Conference on Image Analysis and Processing, 2009."

Subjective View Frustum can be employed as a tool to uncover the visual dynamics of interactions among two or more people. Such an analysis relies on few assumptions with respect to social cues, i.e. that the entities involved in social interactions stand closer than 2 meters; secondly, it is generally well-accepted that initiators of conversations often wait for visual cues of attention, in particular, the establishment of eye contact, before launching into their conversation during unplanned face-to-face encounters; in this sense, SVF may be employed in order to infer whether an eye contact occurs among close subjects. This happens with high probability when the following conditions are satisfied: 1) the subjects are closer than 2 meters; 2) their SVFs overlap and 3) their heads are positioned inside the reciprocal SVFs. The development of this algorithm, that brought the so-called Inter Relation Pattern Matrices data structure permitted to infer whether two or more people are interacting. The results of this research have been reported in Deliverable 2.1 and have been published in

"M. Farenzena, A. Tavano, L. Bazzani, D. Tosato, G.Paggetti G. Menegaz, V. Murino, M.Cristani Social Interactions by Visual Focus of Attention in a Three-Dimensional Environment, Workshop on Pattern Recognition and Artificial Intelligence for Human Behaviour Analysis, 2009."

The social interaction discovery method based on the Inter-Relation Pattern Matrix relies on the definition of a heuristics, that is, the Hall proxemics. This has been proved to be valid in general, but a more sophisticated analysis can be done with learning methods. In particular, we proposed two new methods to deal with social interaction discovery that learn from the data instead of using heuristics. The first method analyzes quasi-stationary people in an unconstrained scenario identifying those subjects engaged in a face-to-face interaction, using the sociological concept of F-formation as defined by Adam Kendon in the late 1970s. Simply speaking, F-formations are spatial patterns maintained during social interactions by two or more people. Quoting Kendon, ``an F-formation arises whenever two or more people sustain a spatial and orientational relationship in which the space between them is one to which they have equal, direct, and exclusive access''. In our research, we design an F-formation recognizer based on a Hough-voting strategy. It lies between an implicit shape model, where weighted local features vote for a location in the image plane, and a mere generalized Hough procedure where the local features have not to be in a fixed number as in the implicit shape model. This approach provides the estimation of the F-formations, so as of the identity of the persons that form them, thus individuating the people which are socially interacting. This contribution is described in Deliverable 2.2 and has been published in

"M.Cristani L.Bazzani G.Paggetti A. Fossati, A.Del Bue, D.Tosato G.Menegaz V.Murino Social interaction discovery by statistical analysis of F-formations, British Machine Vision Conference, Dundee, UK, 2011."

The second method for interaction analysis that strictly depends on the first one focuses on one of the most important aspects of proxemics, namely the relationship between physical and social distance. We show that interpersonal distance (measured automatically using computer vision techniques) provides physical evidence of the social distance between two individuals, i.e. of whether they are simply acquainted, friends, or involved in a romantic relationship. This work is described in deliverable 2.2 and lead to the following publication:

"M.Cristani G.Paggetti A. Vinciarelli, L. Bazzani, G. Menegaz, V.Murino Towards Computational Proxemics: Inferring Social Relations from Interpersonal Distances, SOCIALCOM, Boston, USA, 2011."

On luggage and vehicle detection and categorization (Task 2.3) it is crucial to devise a context model to automatically quantify and select the most effective contextual information for assisting in detecting the target object. To this end, a novel context modeling framework is proposed without the need for any prior scene segmentation or context annotation. We formulate a polar geometric context descriptor for representing multiple types of contextual information. In order to quantify context, we propose a new maximum margin context (MMC) model to evaluate and measure the usefulness of contextual information directly and explicitly through a discriminant context inference method. Furthermore, to address the problem of context learning with limited data, we exploit the idea of transfer learning based on the observation that although two categories of objects can have very different visual appearance, there can be similarity in their context and/or the way contextual information helps to distinguish target objects from non-target-objects. To that end, two novel context transfer learning models are proposed which utilize training samples from source object classes to improve the learning of the context model for a target object class based on a joint maximum margin learning framework. The work is described in

"W. Zheng, S. Gong, T. Xiang. Quantifying contextual information for object detection, IEEE International Conference on Computer Vision, Kyoto, Japan, 2009."

"W. Zheng, S. Gong, T. Xiang. Quantifying and transferring contextual information in object detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011."

In summary, the project partners have tackled all three tasks as described in the Description of Work.

(B) Behaviour monitoring

Science and technology achievements within WP3 can be broken down into contributions under each of the three main tasks: Task 3.1) multi-camera object registration; Task 3.2) multi-camera object tagging or re-identification; and Task 3.3) behavior profiling and abnormality detection.

Multi-Camera Registration

In a system such as SAMURAI, public spaces are monitored by multiple cameras with different fields of view. It is therefore necessary to calibrate and register the cameras to a common coordinate frame in order to reason about global positions of objects. By calibration we mean evaluating the geometry of the cameras: the geometrical properties of the imaging process of each camera (intrinsic parameters), and registration the estimation of the rotation vector R and the translation vector T between pairs of cameras (extrinsic parameters). Optical calibration allows the determination of camera geometry by viewing patterns in the scene whose geometry and size are known. Calibration requires specialist competence. However, we have developed a tool, Caliabra, which makes the task easier – reducing the specialist competence required, standardizes the process by defining the steps the user must perform, and allows sharing of calibration data between relevant parties or software modules. Caliabra was tested in various real world surveillance scenarios, notably in calibration of the ELSAG and BAA T3 sites used for the production of SAMURAI Demos 1 and 2.


Person re-identification involves recognizing an individual in diverse locations as seen in different non-overlapping camera views by considering a large set of candidates. It is assumed that individuals do not change their clothing within the observation period, and that finer biometric cues (face, fingerprint, gait) are unavailable. Re-identification represents a challenging but valuable task in video surveillance scenarios, where long-term activities have to be modeled within a large and structured environment (e.g. airport, metro station). See also the Final System Integration Report (SAMURAI D6.2) and associated Demos to appreciate the critical role that tagging plays in these scenarios. The following subsections give an overview of the re-identification contributions made by SAMURAI.

Symmetry-Driven Accumulation of Local Features (SDALF)

We addressed the appearance-based re-identification problem by proposing a feature extraction and matching strategy based on the localization of perceptually relevant human parts, driven by asymmetry/symmetry principles. We extract three complementary aspects of human appearance: the overall chromatic content, the spatial arrangement of colors into stable regions, and the presence of recurrent local motifs with high entropy. All this information is derived from different body parts, and weighted opportunely by exploiting symmetry and asymmetry perceptual principles. In this way, robustness against very low resolution, occlusions and pose, viewpoint and illumination changes is achieved. Moreover, the approach applies to both the case where a single image for each candidate is present, and the case where multiple images for each individual (not necessarily consecutive) are available. It is appropriate for situations where the number of candidates varies continuously and it has been tested on several public benchmark re-identification datasets (ViPER, iLIDS, ETHZ), obtaining new state-of-the-art performance. Moreover, our signature also proved to be robust to very low resolution, maintaining high performance up to 11 × 22 window size.

Secondly, we proved the effectiveness of our descriptor as an appearance model in a standard multi-target tracking algorithm. In a tracking task, consecutive shots of the same individual are available, and the goal is to use all the information to enrich the object model; and to use the object model to enhance tracking performance. SDALF takes this situation into account, collecting features from all the available shots, augmenting thus its discriminative properties, and serving as a human descriptor for tracking in a very effective way. In particular, we consider the well-known CAVIAR sequence dataset and a Bayesian multi-object tracker. We showed that SDALF definitely outperforms classical object descriptors in terms of different tracking quality metrics, like the accuracy (tracking success rate), and false positive and false negatives rates.

In conclusion, it is worth noting that our technique suits best real scenarios as it operates independently on each human individual, not adopting learning approaches which need the knowledge of a training dataset in advance. Actually, we simply extract a set of reliable, robust, and descriptive spatially-localized features which are combined and matched to univocally individuate a person. This may open up a wide range of possible future developments and customizations, including feature selection by boosting, more efficient matching strategies, the management of multiple scales, and the extraction of feature from more accurate body parts localization.

For further details, see publications: SAMURAI Deliverable D3.1; Farenzena et al, CVPR, 2010; and forthcoming CVIU journal article.

Asymmetry-based Histogram Plus Epitome (AHPE)

Another research effort has been designing a novel descriptor, especially suited for cases in which multiple images of a single individual are available. We propose a novel appearance-based method for person re-identification, that condenses a set of frames of an individual into a highly informative signature, called the Histogram Plus Epitome, HPE. It incorporates complementary global and local statistical descriptions of the human appearance, focusing on the overall chromatic content via histogram representation, and on the presence of recurrent local patches via epitomic analysis. The re-identification performance of HPE is then augmented by applying it as human part descriptor, defining a structured feature called Asymmetry-based HPE (AHPE). Matching between (A)HPEs provides optimal performance against low resolution, occlusion, pose and illumination variations, achieving state-of-the-art results on all the considered datasets. Our descriptor operates independently on each individual, not embracing discriminative philosophies that imply strong operating requirements.

For further details, see publications: SAMURAI Deliverable D3.2; Bazzani et al, ICPR, 2010 and forthcoming PRL journal article.

Associating Groups of People

In a crowded public space, people often walk in groups, either with others they know or strangers. Associating a group of people over space and time can assist understanding an individual’s behavior, as it provides vital visual context for matching individuals within the group. Seemingly an ‘easier’ task compared with person matching given more and richer visual content, this problem is in fact very challenging because a group of people can be highly non-rigid with changing relative position of people within the group and severe self-occlusions. In this work, for the first time, the problem of matching/associating groups of people over large distances in space and time, and captured in multiple non-overlapping camera views is addressed. Specifically, a novel center rectangular ring and block based ratio-occurrence group descriptor and a group-matching algorithm are proposed. The former addresses changes in the relative positions of people in a group and the latter deals with variations in illumination and viewpoint across camera views. In addition, we demonstrate a notable enhancement on individual person matching by utilising the group description as visual context. The approach is validated using the 2008 i-LIDS Multiple-Camera Tracking Scenario (MCTS) dataset on multiple camera views from a busy airport arrival hall.

For further details, see publications: SAMURAI Deliverable D3.1 and Zheng et al, BMVC, 2009.

Custom Pictorial Structure (CPS)

The reidentification method used to obtain the results in the final demos is a novel methodology based on Pictorial Structures (PS). Whenever face or other biometric information is missing, humans recognize an individual by selectively focusing on the body parts, looking for part-to-part correspondences. We took inspiration from this strategy in a re-identification context, using PS to achieve this objective. For single image re-identification, we adopt PS to localize the parts, extract and match their descriptors. When multiple images of a single individual are available, we developed a new algorithm to customize the fit of PS on that specific person, leading to what we called a Custom Pictorial Structure (CPS). CPS learns the appearance of an individual, improving the localization of its parts, thus obtaining more reliable visual characteristics for re-identification. It is based on the statistical learning of pixel attributes collected through spatio-temporal reasoning. The use of PS and CPS leaded to state-of-the-art results on all the available public benchmarks, and opens a fresh new direction for research on re-identification.

For further details see publications: SAMURAI Deliverable D3.2 and Cheng et al, BMVC, 2011.

Probabilistic Relative Distance Comparison (PRDC)

Re-identification is challenging due to the lack of spatial and temporal constraints and large visual appearance changes caused by variations in view angle, lighting, background clutter and occlusion. To address these challenges, most previous approaches aim to extract visual features that are both distinctive and stable under appearance changes. However, most visual features and their combinations under realistic conditions are neither stable nor distinctive thus should not be used indiscriminately. In this contribution, we propose to formulate person re-identification as a distance learning problem, which aims to learn the optimal distance that can maximises matching accuracy regardless the choice of representation. To that end, we introduce a novel Probabilistic Relative Distance Comparison (PRDC) model, which differs from most existing distance learning methods in that, rather than minimising intra-class variation whilst maximising intra-class variation, it aims to maximise the probability of a pair of true match having a smaller distance than that of a wrong match pair. This makes our model more tolerant to appearance changes and less susceptible to model over-fitting. Extensive experiments are carried out to demonstrate that 1) by formulating the person re-identification as a distance learning problem, clear improvements on matching performance can be obtained as compared to conventional person re-identification techniques and significant improvements can be investigated when the training sample size was small, and 2) our PRDC outperforms not only existing distance learning methods but also alternative learning methods based on boosting and learning to rank.

For further details, see publications: SAMURAI Deliverable D3.2 and Zheng et al, CVPR, 2011.

Behaviour Monitoring

We divide the challenges of behaviour monitoring into three broad categories, based on the experienced gained in the SAMURAI project and input from our user partners. These include anomaly / abnormality detection, activity–of–interest detection / classification, and usability support. Note the distinction between anomaly detection and activity detection / classification.

Anomaly detection: Detect activities unlike any typical behaviour that occur in the monitored space. These may be of interest to an end user for a wide variety of reasons, e.g. business, safety or security. These cannot be defined in advance, so detection rather than classification is the only meaningful task. Moreover evaluating such systems is difficult because by definition the real-world events they will be tested on are a priori unknown.

Activity Detection/Classification: Detect and/or classify known behaviour of interest in a particular space. Unlike anomalous behaviour, these are pre-specified and training data may be available. However unlike classic action classification problems, behaviour of interest in surveillance scenarios are often (i) rare and (ii) subtle, in that training examples of illegal or dangerous behaviour are likely to be few and may be purposefully performed subtly by the subject.

Usability support: User partner feedback indicates that to provide practical value in operational environments: (i) interfaces should be as simple as possible, without requiring expert operators with extensive training, (ii) tuning and maintaining the model should require minimal operator interaction, and (iii) false positives should be as few as possible. For these reasons we have focused on developing models that require minimal interaction to train, tune and maintain.

Anomalous Behaviour Detection

Firstly, we address the problem of fully automated (unsupervised) profiling of public space video data. The aim is to learn a flexible model of typical behaviors in a given surveilled public space – based on which suspicious anomalous behaviors may be detected. This task is especially challenging due to the complexity of the object behaviors to be profiled, the difficulty of robust analysis under the visual occlusions and ambiguities common in public space video, and the computational challenge of doing so in real-time. We address these issues by introducing a new dynamic topic model, termed a Markov Clustering Topic Model (MCTM). The MCTM builds on existing dynamic Bayesian network models and Bayesian topic models, and overcomes their drawbacks on sensitivity, robustness and efficiency. Specifically, our model profiles complex dynamic scenes by robustly clustering visual events into activities and these activities into global behaviour with temporal dynamics. A Gibbs sampler is derived for offline learning with unlabeled training data and a new approximation to online Bayesian inference is formulated to enable dynamic scene understanding and behaviour mining in new video data online in real-time. The strength of this model is demonstrated by evaluating unsupervised learning of dynamic scene models, mining of behaviour and detection of salient events in three complex and crowded public scenes. Anomalous behavior detection performance exceeds standard approaches such as HMMs and LDA. The MCTM approach is convenient insofar as being fully automatic and unsupervised.

For further details see publications: SAMURAI Deliverable D3.1; Hospedales et al, ICCV, 2009; and forthcoming IJCV journal article.

Weakly Supervised Rare Behaviour Classification

For cases where behaviour of interest are known in advance, we address the problem of learning to detect and classify rare behaviour in monitored public spaces despite the presence of overwhelming typical behaviour. From an operational perspective, this adds value by allowing the operator to train the system in a labour inexpensive and relatively idiot-proof manner. From a technical perspective, this is of great value because dangerous or illegal activities often have few or possibly only one prior example to learn from, and are often subtle. Rare and subtle behaviour learning is challenging for two reasons: (1) contemporary modeling approaches require more data and supervision than may be available and (2) the most interesting and potentially critical rare behaviour are often visually subtle – occurring among more obvious typical behaviour or being defined by only small spatio-temporal deviations from typical behaviour. We introduce a novel weakly supervised joint topic model (WS-JTM) that addresses these issues. Specifically we introduce a multi-class topic model with partially shared latent structure and associated learning and inference algorithms. These contributions will permit modeling of behaviour from as few as one example, even without localisation by the user and when occurring in clutter; and subsequent classification and localisation such behaviour online and in real time. We extensively validate our approach on two standard public-space datasets where it clearly outperforms a batch of contemporary alternatives including SVMs, LDA and S-LDA.

For further details, see publications: SAMURAI Deliverable D3.1 and D3.2; Li et al, ACCV 2010 and Hospedales et al, PAMI, 2011.

Weakly Supervised Multiple Action Learning

We next address the logical “multilabel” extension of the weak-supervision usability challenge posed previously: Can we design a model to learn behaviour from simple ambiguous list of tags assigned to videos by an operator?

There has been some success in recognition of simple objects and actions in video; however most of this work requires strongly supervised training data. The supervision cost of these approaches therefore renders them economically non-scalable for real world applications. We therefore address the problem of learning to annotate and retrieve semantic tags of human actions in realistic video data with sparsely provided tags of semantically salient activities. This is challenging because of (1) the multi-label nature of the learning problem and (2) realistic videos are often dominated by (semantically uninteresting) background activity un-supported by any tags of interest, leading to a strong irrelevant data problem. To address these challenges, we introduce a new topic model based approach to video tag annotation. Our model simultaneously learns a low dimensional representation of the video data, which dimensions are semantically relevant (supported by tags), and how to annotate videos with tags. Experimental evaluation on two public video action/activity datasets, and one large new broadcast news dataset collected by SAMURAI demonstrate the challenge of this problem, and the value of our contribution in outperforming contemporary alternatives such as ML-SVM and CorrLDA.

For further details, see publications: SAMURAI Deliverable D3.2; Hospedales et al, ICDM, 2011.

Active Learning for Reducing Supervision

Finally, we bridge the previously discussed issues of anomalous behaviour detection and rare behaviour classification. These issues are related because a behaviour which is anomalous/abnormal the first time it is seen is likely to be of interest for future detection. We should provide the ability to learn a model from any such observed instances – which would be more accurate than continuing to rely on the same unimproved outlier detector. Moreover, in the general case where the operator starts off with few or no activities of prior interest, we wish to discover and learn the space of interesting activities, again with minimal interaction.
We present a new active learning approach to incorporate human feedback for on-line unusual event detection. In contrast to most existing unsupervised methods that perform passive mining for unusual events, our approach automatically requests supervision for critical points to resolve ambiguities of interest, leading to more robust and accurate detection on subtle unusual events. The active learning strategy is formulated as a stream-based solution, i.e. it makes decision on-the-fly on whether to query for labels. It adaptively combines multiple active learning criteria to achieve (i) quick discovery of unknown event classes and (ii) refinement of classification boundary. Experimental results on busy public space videos show that with minimal human supervision, our approach outperforms existing supervised and unsupervised learning strategies in identifying unusual events. In addition, better performance is achieved by using adaptive multi-criteria approach compared to existing single criterion and multi-criteria active learning strategies.

For further details, see publications: SAMURAI Deliverables D3.1 and D3.2; Hospedales et al, PAKDD, 2011; Loy et al, ACCV 2010 and forthcoming journal article.

(C) Automated Focus of Attention

Three central themes of SAMURAI are Intelligent Surveillance, Enhanced Situational Awareness and Focus of Attention (FoA), the last being the title of this Work Package. That FoA is crucial to both the former is almost self-apparent. Essentially this means trying to reduce the information being gathered about inconsequential events and locations while gathering more and better information about events and locations of significance. This and related topics are dealt with in some detail in section 1 of D4.1.
We can say that in the SAMURAI context FoA brief embraces any means used to achieve:

(a) reduction of resource expenditure on the scrutiny of superfluous information (i.e. information which provides no cause for alarm)

(b) improved capture of significant information

(c) the intelligent use of intelligence to sift, pursue and interpret surveillance information

Implementing ways of dismissing benign events and features with the least effort possible (a) means less man-power consumption without jeopardizing reliability, (b) embraces both qualitative and quantitative enhancement of data capture, allowing quicker and more accurate assessment of potential alarms. This may lead to some raised alarms being dismissed as false quicker, reducing resource expenditure, and to quicker and more appropriate response to alarms which are confirmed.

Surveillance is in principle inefficient. A flood of data has to be marshalled with very few items of significant content. The raison d'être for FoA is the aspiration to narrow down the data set to be dealt with, focussing on significant content, at the least possible expenditure of time and resources. Also by gathering more and more pertinent information there is more likelihood that predictions of how the a situation will evolve in time and space will be correct, contributing to the same ends.

Item (c) above calls for efforts towards both improving the performance and practicality of automatic intelligence in the system and fuller utilization of the human intelligence available, the “man-in-the-loop” paradigm.

The ways in which this aspiration has been tackled in SAMURAI falls under two main headings, Focus of Attention using PTZ Cameras and Focus of Attention using Mobile Cameras. These headings align with the initially defined tasks 4.1 and 4.2 from the Description of Work.

Focus of Attention using PTZ Cameras

We note that the initial project proposal intended to investigate live PTZ control for focus of attention. This idea was ultimately infeasible due to security constraints for PTZ control in the operational environments of BAA and ELSAG, as well as ethical constraints in potentially obtaining identifiable personal information (faces) of people in each site. The nature of the work has therefore changed relative to the initial proposal towards a contribution on filtering to improve effectiveness of human operators and higher-level algorithms.
In Phase 1 of the project (see Deliverable 4.1 Task 1.1) we presented an abstract theoretical framework for understanding focus of attention. We have further demonstrated two concrete applications of FoA, for humans and machines respectively.

Phase 1: Automated Active Perception

All surveillance systems operate under two hard resource constraints: data collection, and attentional processing. The data available for a human operator or SAMURAI system to analyse is a small and imperfect window into the true state of the world. It is affected by constraints imposed by the physical mechanism collecting data, notably pan-tilt-zoom (PTZ) parameters as well as camera resolution, frame rate, etc. Analysis of this data, by human or machine, is also resource constrained. For example, typically the number of operators available to monitor cameras is many times less than the number of cameras; and available CPU time is typically insufficient to fully analyse every frame from every camera in real time given the computational intractability of most state of the art video analysis algorithms.
Experienced human operators optimize their performance within these constraints. For example, controlling cameras parameters to preferentially obtain information about known hot-spots of activity or about a specific suspicious person or unusual activity. Similarly, they may focus their own attentional resources at particular zones, behaviour or times of day which they have learned are likely to be of interest. In both cases they aim to maximize the effectiveness of the resulting surveillance cover under the constraint of available hardware and human attention. This process of intelligently allocating surveillance observations and attentional resources is formally known as Active Perception (AP). We have shown a framework for automated active perception which makes several contributions to the state of the art in order to better achieve SAMURAI goals.

The Active Perception paradigm aspires to make better use of an installation consisting of static cameras, at least some of which have controllable PTZ. In a legacy system, there is staff in a control room observing views from the cameras. When any event or feature is detected which might be a candidate for an alarm the staff makes use of the PTZ to gain more and/or better-quality information from the space-time vicinity of the candidate. Here the aspiration is to employ pre-emptive decision-making about what to focus on. When Active Perception is machine-aided, the aspiration is to shunt some of the human effort and vigilance required over to a computer-based system “watching” and recording the camera streams.
This is an area where the progress made in WP2 (Object Detection and Categorization) and WP3 (Behaviour Monitoring) provides tools which help FoA. Machine support for Active Perception requires computer support which can replace or assist human operator vigilance and response, for which those Work Packages have provided a foundation.

We have introduced a theoretical framework for focus of attention and considered specifically issues related to focus of attention using PTZ cameras (Task 2.1). The formalism for focus of attention computation models how attention should optimally be focused to maximize the effectiveness of surveillance coverage while minimizing the cost required to provide it. We have illustrated these ideas by way of a simple synthetic example, and described how they may be applied to both digital and optical PTZ control.

Phase 2: Human and Machine Focus-of-Attention

In this phase we developed two concrete applications of FoA, for humans and machines respectively.
Human FoA: A key point highlighted by SAMURAI user partners, is that the number of system operators and their cognitive resources are highly limited relative to the amount of data to be monitored. FoA can use human cognitive resources more efficiently by automatically alerting the operator to a zone where the activity may be of interest or by filtering the multitude of recorded video streams to a small number of most interesting streams.
Given the large amount of scenes to be monitored, it is very beneficial to have an automated way to focus the attention of the operators on interesting/salient events taking place in crowded public scenes. We have presented a motion saliency detection method based on Fourier spectral analysis on the motion space. Given a scene with the presence of dominant crowd flows, the proposed method aims to discover and localise interesting regions where flow is salient in relation to the rest of the scene.
Note that all the abnormality detection methods developed in WP3 Task 3.3 (see Deliverables D3.1 and D3.2) are directly applicable to the interpretation of the Human FoA task.

We have developed a computationally effective and fast global motion saliency detection method to facilitate focus-of-attention in motion space. We have shown its effectiveness in discovering and locating interesting region in SAMURAI scenarios, and demonstrated its potential in counter flow detection and instable region detection in extremely crowded scenes. In contrast to most existing methods that rely on object-centred tracking and segmentation, the proposed method exploits holistic flow patterns and spectral analysis to detect salient regions. It is thus suitable for motion interpretation for crowded public spaces, where object tracking is intrinsically hard.
Motion saliency

Given the large amount of scenes to be monitored, it is critical to have an automated way to focus the attention of CCTV operators on interesting/salient events taking place in crowded public scenes. A motion saliency detection method has been developed based on Fourier spectral analysis on the motion space. Given a scene with the presence of dominant crowd flows, the proposed method aims to discover and localise interesting regions where the flow is inconsistent with the rest of the scene.
Machine FoA: Similarly to the cognitive resource constraints imposed by human operators, machine behaviour analysis systems have limited CPU time available. This prohibits full analysis of all of the numerous simultaneously recorded video streams. By leveraging a cascade of models of increasing computational complexity - where early models have low false positive rates - to filter the input, the “attention” of the more computationally expensive visual analysis modules can be allocated efficiently to study useful instances.

Machine FoA for efficient tracking and matching

A crucial element in maintaining focus-of-attention on a moving target with static cameras is cross-camera matching (see Deliverable D3.2 Task 3.2) a task that is very challenging. This is especially so without manual calibration of the multi-camera network parameters such as camera layout and view geometry. Additionally, even within-camera tracking with multiple targets is difficult without a good predictive model of each target’s dynamics. We have developed a simple unified behaviour model, which can support both of these tasks, making them both more efficient and more accurate.

We have addressed the run-time computational complexity of using such a model and generalised the concept to cover multi-camera spaces. It is thus suitable for supporting real-time monitoring of public spaces with a mixture of overlapped and non-overlapped cameras. Extensive examples of its application with behaviour profiling are presented in D4.2 section 1.3.

Focus of attention using mobile/wearable cameras with audio and position sensors

Mobile surveillance device

It was apparent from the outset of SAMURAI that the inclusion of body-worn surveillance modules, with wireless data communication as well as on-board recording, could greatly enhance the performance of the heterogeneous network of a SAMURAI installation, in terms of both FoA and overall situational awareness.
• providing CCTV coverage of areas and from points-of-view inaccessible to static cameras
• allowing closer viewing of scenes of specific interest
• bringing to bear the acuity of a human operative combined with the data capture and recording capabilities of CCTV
• enabling quick adaptation of the vigilance template (i.e. what features are most worthy of attention)
• including close-up audio information
• including spoken information from the operative close to the scene
• automatic location of the operative, and therefore of any event she is witnessing, even if that event is occluded from the static camera system

The Ninja

The need for a mobile surveillance module with on-board processing and wireless was clear. It was also established that no off-the-shelf products would fit the bill.
We undertook the design, manufacture and programming of a custom wearable module which became known as the Ninja (stealthy spy). The Ninja consists of a wearable computer, known as the BPS Ngine, with a video-audio capture and recording subsystem, and support for a variety of USB-connected peripherals, including a camera and a wireless modem. The Ngine is very compact, featuring a high ratio of computing power to electrical power consumption and size. Already during Phase 1 a small number of Ninjas were deployed and used in SAMURAI experiments. Manufacture has gone through 4 cycles of hardware design and six revisions. During Phase 2 a batch of 50 units was manufactured, along with robust pouches for wearing the device, and made available to all participants in SAMURAI.

Some similar products have since become available, but do not compare with the Ninja particularly in terms of computing power, connectivity and above all flexibility. The Ninja represents an advance beyond state-of-the-art. We would note that thanks to these features the core module, the Ngine, is applicable for a wide range of other purposes apart from mobile surveillance. As the underlying firmware is based on a standard Linux, it is open for user to develop their own applications, be they for surveillance or any other purpose.

All the units made for SAMURAI use come with a touch-screen display and graphic user interface. In some other uses the display can be omitted, all interaction being effected via USB or a wireless link.

The Ngine can accept a wide variety of third-party peripherals, thanks to five micro-USB sockets (which have since become the standard for most mobile devices). In a Ninja these devices include of course a camera, usually a Wifi modem, optionally a Bluetooth modem, etc. It also features an on-board compass and accelerometer whose output is available to any applications.

Key features of a Ninja include:
• the compact, power-efficient small computer known as the Ngine
• a state-of-the-art wearable USB camera with up to 2 megapixel resolution and microphone
• on-board video compression (H.264 or MJPEG), and audio compression (AAC)
• on-board recording on micro-SD card (up to 32 GB)
• Wifi communication for AV data streaming and download, and remote control (via Wifi USB dongle)
• docking via USB cable to a host computer for data download and programming
• removable battery pack, in two sizes
• longer than 9 h continuous recording or 4 h with continuous wireless streaming
• optional Bluetooth USB dongle and remote controller
• a robust belt-worn pouch

The Wifi network

A vital feature of the heterogeneous SAMURAI system is that the mobile surveillance units are incorporated at all times via a wireless link. As described above the Ninja has been built to make use of such a network. We also dedicated resources to investigating and evaluation different WLAN hardware and configurations. It was found that considerable performance gains could be achieved by using an IEEE 802.11n wireless network and adaptive beam-forming wireless points.
We have recommended the Ruckus WiFi mesh networking system for use in the SAMURAI project. Proof-of-concept tests were donein Tallinn with a Ruckus network and at the ESA Projekt site in Katowice, Poland using a similar wireless mesh system - Hewlett-Packard ProCurve. The Ruckus system is currently more adaptive and provides a self-healing multi-path network with auto-adjusting directional antennas.

Note meanwhile that it broadcasts the Beacon Frames in circular antenna mode, which is important for the RSSI-based positioning system – see below.

Mobile positioning

It is obviously valuable for the SAMURAI Focus-of-Attention (FoA) paradigm to get information about the position, and if possible attitude, of a mobile surveillance device, two of the main reasons being that the location of a scene being examined by the agent can be known automatically, and if the positions of agents are known then those best positioned to deal with a situation can be delegated to it.

Knowing the position/trajectory of a mobile Ninja and its wearer brings several important advantages to the SAMURAI system.
(a) The view being recorded/transmitted by the unit can be correlated with the known topology of the space;
(b) The camera's position can be shown on a visual 3-D representation of the site to give control-room staff a better awareness of the situation.
(c) For any data-fusion or 3-D modelling from multiple camera views, it is important to know the point of view (PoV) of all the cameras,
(d) If a FoA action (see eq.1.6) requires a close-up view one or more Ninja wearers need to be alerted to proceed to a certain place at a certain time. If their current positions are known then control-room staff or other roaming personnel can know which Ninjas can be most efficiently deployed. As the staff are, in general, trained to deal with alarms, getting them to the right place quicker can help contain an alarm situation.

As explained in D4.1 during Phase 1 we investigated a variety of positioning techniques in some depth. GPS of course springs to mind as a good option, but is unfortunately out of the questions as the mobile agents will most often be indoors or even underground. We chose to make use of the WiFi positioning solution offered by the Finnish-based company Ekahau. Good relations were established with Ekahau, and positioning experiments and demonstrations have been performed at the BPS site in Tallinn, at EsaProjekt in Katowice and at the Elsag test site in Genoa. More details are to be found in D4.1 section 3.3.

One of the advantages of Wifi positioning is that a SAMURAI installation anyway needs to have a dedicated WLAN for reliable high-speed data transfer. To utilize the Wifi positioning system however a detailed site survey (a walk about) needs to be done to map the strength of the wireless signals in as much detail as possible.

As noted, accuracy depends a lot on how carefully and thoroughly the site survey is done. It also depends on how many transmitters are contributing to the finger print, and on how convoluted the topology of the site is (the more convoluted the better!). A cautious upper limit is a radius of 10 m, though in our experiments <5 m was almost always obtained. On the ESA Projekt site in Katowice the accuracy was consistently better than 2 m radius.

Other positioning methods

The UNIVR team are developing methods for determining the location of a PoV (point-of-view) by reference to features recognized in the video frames taken at that point by comparison with previous mappings. The MJPEG images streamed by the Ninja to the Elsag backend system were judged to be of sufficient quality to enable this method to be applied. At this time it is still a long way from a real-time procedure able to provide the kind of immediate data needed by live focus-of-attention methods, but the performance of the algorithms is promising (see WP 2).

The Ngine's on-board compass and accelerometer open up the possibility of using dead-reckoning to enhance position determination by other methods, and to provide attitude information. The open platform nature of the Ngine lends itself to such development by whichever partner takes up that challenge.

Audio focussing

We had initially deemed that focussed capture of sounds as well as images could contribute to the situational awareness enhancement which is one of SAMURAI's key themes. The principle is to use some combination of microphone configurations and software so as to be able to selectively “eavesdrop” on a particular location at distance. Considerable background research was done towards this end and we even initiated collaboration with a university team.

This line of research was eventually abandoned, both because of the ethical objections raised, and the realization that the scope of the work required exceeded the resources of the partners involved. The background research has however been reported in some detail in D4.1.

On-board versus backend processing

We set out to evaluate the relative merits of performing video content analysis inside the Ninja or delegating it to the backend system receiving the audio/video stream from the Ninja. The Ngine does permit some degree of video content analysis on board, as has been shown with an example of face detection. However, thanks to enhancement of the image quality and streaming capabilities achieved later on in the project, it emerged that in most cases such processing could be more appropriately delegated to the recipient system.

(D) Data Fusion and Visualisation

This section of the report summarises the overall achievements that have been made in WP5 over the duration of the project. These achievements are cross-referenced to the original scientific objectives, novelties, and measurable technical outcomes defined in the SAMURAI SoW. It is demonstrated that WP5 has successfully addressed all of the relevant areas of the SoW, and has advanced the state of the art in the associated technical areas.

The key scientific and technical achievements associated with WP5 are:
• The development of a state of the art 3D reconstruction pipeline which is capable of creating immersive 3D representations of scenes from collections of 2D images or video. The pipeline has been demonstrated to provide performance on par with other state of the art approaches, and is significantly more computationally efficient. Furthermore the pipeline is capable of functioning in both calibrated and un-calibrated cases. The Figure 2 in the attachment demonstrates the application of the pipeline to the reconstruction of a scene.

• The successful application of the 3D reconstruction pipeline to the final SAMURAI evaluation data sets, allowing a 3D reconstruction of the Elsag Datamat site to be rendered with the main SAMURAI GUI. This demonstrates both the viability of the pipeline in real-world scenarios, and also demonstrates the integration between the 3D Reconstruction and Visualisation module and the SAMURAI GUI module. The Figure 3 in the attachment shows the reconstruction of the Elsag site.

• The development of state of the art mobile camera localisation algorithms. These algorithms were applied to the final SAMURAI evaluation data sets and have been shown to be capable of localising a mobile operator (who wears a mobile camera) to an average accuracy of less than 5m. The Figure 4 in the attachment demonstrates the use of the mobile camera localisation algorithm using the final SAMURAI evaluation data set.

• The development of a state of the art GUI for the SAMURAI programme which provides a live, interactive 3D representation of the scene, live cameras views (from both fixed cameras and mobile cameras) and event and alert warnings. The Figures 5,6, and 7 in the attachment provide relevant screenshots of the GUI.

• The application of the SAMURAI GUI to the final SAMURAI evaluation tests sets. This has successfully demonstrated the advanced features of the GUI. Furthermore, the SAMURAI GUI has served as the main interface for demonstrating the overall integration activities associated with the SAMURAI project.

• The development of a novel Data Fusion for Behaviour Analysis algorithm which fuses automatically generated behaviour alerts, a-priori watch-list information, and human input, to produce a refined, prioritised event list.

• The development of a real-time Data Fusion Demonstrator, which embodies the Data Fusion for Behaviour Analysis algorithm. The demonstrator has been applied to the final SAMURAI evaluation data sets and clearly demonstrates the increased level of Situational Awareness that the Data Fusion process provides. A screenshot of the demonstrator is provided below.

• The development of a Data Fusion for Mobile Operator Localisation algorithm. This algorithm fused the outputs from the mobile camera localisation, fixed camera tracking, and WiFi positioning modules to provide a more robust and accurate estimate of the location of mobile operators.

• A review of appropriate Data Fusion architectures for the SAMURAI context, and the proposal of an approach to automatically mitigate the Data Incest problem, in the presence of uncertainty in the reliability of the information sources.

The original scientific objectives and novelties, and measurable outcomes associated with WP5 are outlined below (as defined in the original SoW). Comments regarding their fulfillment are also provided.

Scientific Objectives

"Develop innovative tools using multi-modal data fusion and visualisation of heterogeneous sensor input to enable more effective control room operator queries."
• This objective has been fulfilled. The Data Fusion algorithms and concepts exploit disparate multi-modal data sources, the visualisation techniques and GUI illustrate these heterogeneous sensor inputs (and processed results) in an efficient and effective manner.

"Develop a framework to facilitate the acquisition, analysis and fusion of information from multiple and distributed data sources, the primary objective being to transform a large volume of information into useful knowledge that can be presented in a timely fashion to an operator or control system."

• This objective has been fulfilled. A flexible acquisition and analysis framework was constructed by the consortium, and the outputs of WP5 specifically allow this information to be presented to the user in a form which represents "useful" knowledge rather than raw data. This is aided by the 3D Reconstruction and Visualisation module, the main SAMURAI GUI, and also the real-time DFD.

Main SAMURAI Novelties

"SAMURAI will employ networked heterogeneous sensors rather than isolated visual sensors so that multiple complimentary sources of information are fused to visualise a more complete and "big picture" of the area under surveillance."

• WP5 has directly addressed this intended novelty. The main SAMURAI GUI presents data from heterogeneous sources of information, in addition to fused data produced by the Data Fusion algorithms. The GUI, in conjunction with the 3D Reconstruction and Visualisation module provides increased Situational Awareness (i.e. a "big picture") of the are of interest. The Data Fusion algorithms developed under WP5.2 exploit complimentary data sources to improve object localisation accuracy and operator understanding.

"SAMURAI will integrate fix-positioned CCTV video input with control room operator queries and mobile sensory input from patrolling staff for more effective man in the loop decision making. This is in contrast to current system which rely purely on processing sensor data."

• This novelty has been directly addressed by the Data Fusion algorithms developed under WP5.2. The real-time DFD facilities live, man-in-the-loop Situational Awareness and decision making. The underlying Data Fusion algorithms fuse incoming behaviour events with a-priori information and the end-users current priority-set to ensure that the most pertinent data is highlighted for analysis.

Scientific Objectives and Measurable Technical Outcomes

"Determine the level of spatial and temporal registration and alignment of multi-modal data required to provide enhanced situation awareness of the SAMURAI system."

• The main SAMURAI GUI provides enhanced situational awareness by automatically handling the spatial and temporal alignment of all the SAMURAI data sources. The resulting user-interface is therefore intuitive and efficient to use.

"Develop a coherent and robust methodology for 3D visualisation and interpretation that enables an user to understand, and respond to, an evolving situational awareness picture."

• All of the modules developed under WP5.2 contribute towards providing a coherent and robust representation of the scenario as it unfolds. The main SAMURAI GUI and the 3D Reconstruction model help the user in understanding the spatial and temporal relationship between different data sources, and the Data Fusion modules aid the user in analysing the current situation and any associated threats.

"Develop and integrate a Data Fusion architecture/algorithm that can provide effective and efficient data selection, and is capable of identifying and correcting data conflicts and system errors."

• The Data Fusion modules developed under WP5.2 aid the end-user directly in the "data selection" process. This is achieved by dynamically prioritising events and alerts according to the user's current priorities. In addition, the Data Fusion algorithm that underpins the process automatically identifies and handles data conflicts and uncertainty.

Going Beyond the State of the Art

The achievements made under WP5 of the SAMURAI programme contribute significantly towards the SAMURAI programme's advancement of the state of the art. The SAMURAI system represents a highly integrated and advanced Situational Awareness system. Many of the capabilities that the SAMURAI system provides are facilitated by the modules developed in WP5. The 3D Reconstruction and Visualisation module provides an immersive 3D scene representation that is rendered by the main SAMURAI GUI. The combination of these two capabilities allows the user understand the spatial and temporal relationship between different camera sources, and the objects and people that they observe. This contrasts with existing systems which often do not provide the user with sufficient contextual references. The use of Data Fusion algorithms ensures that the SAMURAI system displays the most pertinent data and knowledge regarding the current situation. In addition it allows the end-user to alter the current "priorities" in real-time, which results in almost immediate re-prioritisation of behaviour alerts. A-priori information that is available is also fused with the behaviour alerts to improve the end-user's decision making. Unlike existing systems, the SAMURAI system therefore provides main in the loop decision making.

Potential Impact:

The SAMURAI project has been carefully formulated to have extensive technical, social and economic impacts upon the many security stakeholders in the EU:

• Technical: SAMURAI have developed groundbreaking technology that can be interfaced with existing CCTV systems that are already employed widely within the EU. By concentrating the technology developments onto multiple cameras and mobile cameras, many of the limitations of the existing state-of-the-art have been overcome by incorporating strong end users with a widely deployed CCTV system in the Consortium. The performance of the system have been quantified in real life scenarios to give a firm base for future technical developments.

• Social: Security in public places is required for the correct functioning of society. However, existing CCTV systems are not effective at prevention of many incidents, since it is virtually impossible for an operator looking at a bank of monitors to notice unusual behaviour since the human brain rapidly loses concentration after watching the CCTV output for just a few minutes. Rather, existing CCTV systems are frequently used to analyse incidents after the event from recordings made. In contrast, SAMURAI allowes prevention and rapid-response to events as they unfold. Consequently, the main social impact of SAMURAI is the increased public confidence in security systems in public places.

• Economic: The use of CCTV as a security and management aid is widespread in the EU and offers a huge marketplace for European business. One of the consortium partners (Elsag) is a major world-player in this market. However, increasingly, the hardware of CCTV detection is being provided by imports from China and India. In order to compete, EU companies will need a technology advantage. SAMURAI technology can provide a higher ‘added-value’ to installed CCTV system and give European producers a substantial advantage in the marketplace.

Main Dissemination Activities


A snapshot of the project website ( is shown in Figure 8 in the attachment.

In creating a web site for the project we wanted an attractive, professional look but above all one to communicate the image of a project with scientific and industrial challenging objectives.We also wanted something that could grow and evolve during the project and make visitors curious to explore the new parts and come back fro new information. Workshop announcements have been a great opportunity for visiting the web site. The announcements appear in the first page an reference to the site is continuously made in the circulated mails.

To effectively communicate the aim of the project to a broad audience (even if limited to people with some technical knowledge or at least of the application context). To do it effectively (sites should be better than reading brochures). To achieve this, the use of video clips has been favored. The videos make a direct connection (intuitive) between the technology and the day to day scenarios we all experience.


Numerous events have taken place besides the officially planned ones. These meetings deal principally with the relationship between members of the consortium and their respective organizations (typically internal presentations or reviews) but also dissemination to the sales organizations and in general all the people dealing with innovation. Detailed events list can be found later in this report.


User Group:

The User Advisory Group (UAG) has been used to steer the project and manage the wider audience of users and product developers.

The UAG members, usually not full members of the Consortium, have played important roles in advising on user requirements an specifications, providing a variety of scenarios and data capture and system evaluation.


Three major workshops have been held in the course of the project.

1st-End User Workshop - London 5th and 6th November2009 - This workshop, like the last, has been held at the site of the major end user and inspirer of the application themes addressed in the research. The participants from a cross section of academia and industry has highlighted the problems of VA and discussed the possible approaches. This workshop has been incredibly useful in highlighting possible pitfalls and focusing on the “true” objectives.

2nd End User Workshop-15th and 16th July 2010 (Genova) - This workshop has been held at the integration and test site. The audience has been larger than the previous workshop and we have also seen the presence of a Company (BRSLABS – USA) that is actually trying to market a products with learning capabilities. The most notable difference between the first and second workshop has been a greater emphasis on implementation and success stories.

3rd and final Workshop -27-Oct-2011 (London) - This workshop has been held at the end user site BAA. Their surveillance infrastructure serves some of busiest airports in the world. BAA interest in “intelligent systems, with the potential of reducing the workload on operators and at the same increasing system effectiveness seems natural. System learning capabilities can also put to use in fields other than security. Many, more related to operations, exist in airports that require an answer.



The project brochure (see Figure 9 of the attachment) has been, together with web site, the banner of the dissemination activities. In all the events where Samurai project staff have participated the brochure has been distributed. The figure depicted on its front has become a distinguishing feature of the project and its easy recognizability has made it well known in the EC project community.

Scientific and technical publications

Given the considerable number of peer-reviewed publications generated by the consortium Partners, the problem of producing a working SAMURAI system is challenging. Based purely on existing knowledge and techniques in the literature, it was insufficient to build such a working system.
The SAMURAI idea of an adaptive visual inference system has required not only a study and improvement of available knowledge, but also innovative extensions and investigation of new paths.

In particular, novel advancements are necessary and have been made on the following topics:

Behaviour analysis: this is still a fundamental research issue and the basis for SAMURAI implementation. The articles produced expand the knowledge base either through the application of new techniques or measurement of performance.

Another big theme of the research and the future applicability of learning mechanism is the investigation and development of supervised/unsupervised approaches. The issue is quite relevant in the security community (both scientific and industrial) for different reasons: on the scientific side the theme is challenging and it represents still an open research topic. On the other hand companies are seeking the most economic way to deal with practical everyday problems and have the desire for fully unsupervised learning systems. The results from SAMURAI approach this problem both from a pragmatic point of view and also proposing novel methodologies. The abundance of published work is here justified by the many issues emerged from the application domain of SAMURAI.

Necessary to SAMURAI success is the modelling of inter-relation between people and groups and also studying mechanisms for guaranteeing that people are successfully identified between connected environments covered by different cameras (spatial and temporal behavioural context). This re-identification issue is well identified by the scientific community and very challenging. Many problems had to be further enquired to reach a sufficiently robust approach for addressing the project requirements.

One must also notice how new ground has also been covered on ethical issues. Testimony of this is the accepted SAMURAI chapter contribution to the forthcoming new book on:

“Effective Surveillance for Homeland Security: Balancing Technology and Social Issues” written by the participants to Samurai ethical committee.

Lastly, it is worth pointing out that the proposed SAMURAI methodological approach has required a considerable effort in convincing potential users and the technical community of its validity. This is vouched by the number and nature of seminars. The technical discussion and the titles of the different contributions demonstrate the width of the considered methodologies and their limitations, to support the development of more versatile solutions (yet not completely available for commercial use).

The detailed scientific publications list can be found later in this report.

Exploitation of Results

The main exploitation activities of the project partners are detailed below.

Queen Mary – University of London (QMUL)

• A joint development programme through QMUL spin-off Vision Semantics Ltd for building operational demonstrators for a distributed multi-camera tracking system, working with UK government and a major industrial engineering integrator.

• Through QMUL spin-off Vision Semantics Ltd, some of the research developed from SAMURAI is to be applied to a new collaborative project under the EU FP7 Transport Programme.

• Exploitation of SAMURAI research in a new project developing technologies for automated recognition of human body language (funded by the UK government).

• Exploitation of the SAMURAI distributed multi-camera system platform for the development of an adaptive target super-resolution system through QMUL spin-off Vision Semantics Ltd (funded by the UK MOD).

• Through QMUL spin-off Vision Semantics Ltd, exploitation of the SAMURAI multi-camera tracking capability for airport passenger experience management in a joint development programme with BAA utilizing a Queen Mary Innovation Grant to support on-site testing.

• Exploitation by QMUL spin-off Vision Semantics Ltd of a feasibility study on video semantic tagging funded by the UK MOD.

• Feasibility study by QMUL spin-off Vision Semantics Ltd on mobile platform video content analysis through a UK MOD funded project.

• Software development by QMUL spin-off Vision Semantics Ltd for wearable camera based human action and intent recognition through a US DOD funded project.

• Software development by QMUL spin-off Vision Semantics Ltd for human intention profiling in public spaces through a Innovation China UK Seeding Fund project.

Universita` di Verona (UniVr)

UniVr is interested in obtaining the improved results related to:
• Pedestrian detection: The possibility to detect pedestrians both in indoor and outdoor scenes, naturally under different light conditionings, particularly in crowded scenes and partial occlusion, as well as the capability of separating people who are positioned close to each other is challenging enough for research purposes both in academia and industry.

• People tracking: the ability of tracking people and their movements and handling occlusions, along with recovering a lost target for having as accurate trajectory as possible plays an important role in many computer vision applications. For controlling the security of a large area monitored by a sensor of networks, tracking people is a base for re-identification algorithms in order to have an autonomous surveillance. It is also important for analyzing a subject during a period of time.

• Re-identification: It is the capability of identifying a subject in several fields of view. When a tracked object exits the field of view of the current camera and after a variable period of time it enters into the field of view of a given camera, it should be detected automatically by detection algorithms. After that, re-identification algorithms should be able to automatically decide that these two objects, i.e. the tracked one and the detected one, are actually the same.

• Implementation of some heavy computations at hardware level: Using FPGA and Embedded systems for implementing some time-consuming algorithms at hardware level to achieve a high-speed running of the approaches

Other exploitation actions

• Developing and strengthening our testbeds and functions related to different aspects of the SAMURAI project in order to improve our research facilities and possibilities necessary for the potential future industrial and research projects, since the above-mentioned items are very common among different applications of computer vision.

• Identification of research streamlines and practical challenges needed for industrial applications.

• Identification of some collaborator in university or industry for possible future joint works.

Elsag Datamat (ED)

The contribution of ED team to the exploitation of SAMURAI project will be performed along the following lines:

• Identification and description of specific application scenarios, corresponding to the most relevant opportunities in the market.

• Definition of system evaluation procedures to compare the new proposed approach against the quickly evolving innovative solutions emerging from the international market and the products proposed by the most advanced competition

• Integration of some specific subsystem and technical components into the ED proprietary video surveillance platform to demonstrate some relevant technology progress over the state of the art (i.e. multi-camera calibration, human re-identification and video search)

• Identification of some specific collaborative customer (Railway companies or Highway responsible) to possibly arrange a pilot project for testing some advanced processing functions as emerged from the SAMURAI project

The exploitation planning by ED can be classified in two steps, according to the time frame of the planned initiatives. A short-term plan will be implemented, starting already during the last months of project development, to make use of all the available results. A long-term plan is also devised, to envisage a careful coordination with the business department, aiming to achieve a significant improvement of the position of the company in the security market.

Waterfall Solutions Ltd (WS)

WS proposes to exploit the research and development conducted under WP5 of the SAMURAI project using the following approach:
• Demonstrate real-time fusion system prototype to potential end-users, collate feedback and refine system further (short-term)

• Evaluate SAMURAI derived fusion algorithms and concepts into existing WS technology base (short-term)

• Develop advanced situational awareness systems which can be tailored to different applications and use-cases according to end customer's needs (medium term)

• Build on the algorithms and concepts developed under SAMURAI to provide additional capability and performance (long-term)

Borthwick-Pignon (OÜ) (BPS)

BPS role in SAMURAI project has enabled the development of nomadic or PDA type of portable module capable of AV recording, pre-processing and transmit audio/video signals captured by wearable cameras and other type of sensors to facilitate the focus of attention with man in the loop in a semi automated surveillance environment.
As a result of this development work BPS is sole owner of the generic product with codename
Ninja consisting of:

Ngine - Mobile Dual Core Processing Platform


Ninja - wearable hands free security audio visual recording system.

Although initially developed for processing of audio and video data with respective metadata, Ninja can be seen as a more generic platform applicable for many areas of use.
Further to video surveillance, Ninja can be used as a data receiving, collecting, organizing and sending platform in monitoring systems of IT equipment, public and private transport vehicles, personal healthcare and many other solutions where optimal low power or mobile data collecting, organizing and sending capabilities are needed.
BPS strategy toward the second objective intends with new partners developing full scale of infrastructure solution, which can serve as a generic platform for deployment of Ninja units.

Deployment of such generic platforms will allow BPS to go along two fold strategic path.

• In the areas, where there is already established markets and systems deployed, BPS will aim to co-operate with current market players by making its generic platform interoperable with existing systems deployed.

• In the areas, where there is no well established markets and systems deployed, BPS aims via cooperation to market and offer product or services upon its generic platform by developing and deploying new functionalities. We are cooperating with major partners towards new Homeland Security for new smart container program to deploy Ngine's with sensors detecting hazardous materials in the handling area.

Two fold strategic path gives BPS wide scale flexibility for usage its generic platform in different fields of possible use.

Three potential areas of commercial interest – wireless interoperability, traffic safety and monitoring of IT systems are highlighted as possible new routes for BPS Ngine solutions for new markets.

Partners for cooperation have been found so far :

• CELTIC Federated Testbed for Public safety communications, in the field of wireless interoperability, Ninja is being supplied under agreement.

• Mohlaleng of South Africa in the field of monitoring of IT systems.

In the above fields the core of BPS value proposition is not to compete with current market players, but to lever their existing capabilities and revenue streams by providing technologically advanced solutions.

In the field of traffic safety BPS intention is to participate in a possible large scale trials by providing its Ninjas and middle-ware, if necessary. Such a participation helps BPS to improve its solution for traffic safety purposes and demonstrate its cutting edge usefulness in that field of use.

Esaprojekt SP. ZOO (ESA)

ESAPROJEKT plan to use the results of SAMURAI in future activities and your business. To do so ESAPROJEKT is going to:
• Demonstrate 3D model and User Interface developed during the Samurai project (short term)

• Incorporate some of the key functionalities of Samurai GUI to products ESAPROJECT is offering to its clients (medium term)

• Cooperate with other partners of Samurai project to provide advanced solutions to ESAPROJECTs client (long term)


BAA hopes to Exploit Samurai research results by engaging with the aviation industry, academia, technology partners and airport partners to encourage stakeholders to collaborate in taking this research out of the laboratory and into frontline operations.

To achieve this we are taking a long term and flexible view of developing products that are fit and robust for operational purposes. Utilising BAA’s extensive experience in new technology pilot programmes to demonstrate the full potential of research in a controlled and cost effective environment. BAA Heathrow Airport and Heathrow Express have developed a number of business problem scenario briefs, which if addressed in a collaborative and cost effective way would lead to a profound and positive impact on airport and rail management.

In pursuance of these objectives the following exploitation strategies are commended:

• Samurai Industrial partners together with BAA develop a prioritised and costed product development road map for key high value project technologies.

• Develop commercial cradle to grave models to support and enable technology transfer from the research lab to the frontline operation.

• Develop exploitation strategies to demonstrate productised Samurai technologies in a joint venture between industrial product developers and BAA using a BAA airport as a reference site.

• Explore incentivising end users to purchase productised Samurai technologies and introduce them into their technology procurement programmes.

• Under the auspices of the EU FP7 Transport programme further develop the Samurai research with BAA Heathrow Express into new novel transport safety technologies in collaboration with industrial partners.

• Exploitation of Samurai vehicle/driver detection technologies by developing joint marketing strategies with industrial partners and end users.

• Commission a feasibility study into integration detection and mobile technologies into various existing end user CCTV network

Contact: Professor Shaogang Gong, Queen Mary, University of London
Tel: +44 (0) 20 7882 5249
Fax: +44 (0) 20 8980 6533