Annotated Digital Video for Surveillance and Optimised Retrieval

A module that connects to an ADVISOR system through a network connection and a well-defined software interface using CORBA and network socket protocols. Its main data inputs are time-stamped JPEG-encoded video frames and regular images of an estimated background images. At the core of this module is a DSP board (Sollatek's STM1300 board) that handles data for two cameras at five frames/second. The DSP module contains software that analyses each image and detects situations of interest (normally indicative of potentially dangerous situations) including overcrowding, congestion, and abnormal directions of movement, excessive stationarity and intrusion. The algorithms for detection of these events were first developed in the EU CROMATICA project and refined in ADVISOR. The module has been tested in interior circulation areas of metropolitan railways environments. The DSP software is wrapped by a Windows 2000 compatible software (that currently handles two DSP boards) that interfaces it to the rest of the system to send annotations to other higher-level modules (that decide how to use such information), to handle the acquisition and decompression of JPEG image streams and to configure working parameters such as alarm thresholds and areas of interest. The module has been integrated to the rest of the ADVISOR system and demonstrated.

In the context of the ADVISOR project, multiple cameras with overlapping FOVs (Fields Of View) are used. To increase robustness, behaviour recognition first relies on a low level motion detection and a frame to frame tracker which generates a graph of moving regions for each camera. These moving regions have been classified in several classes, allowing to know if the region corresponds or not to the image of a person or group of persons. To take advantage of all cameras observing the same scene, a combination mechanism is performed to combine the graphs computed for each camera into a global one. This global graph is then used for tracking on long term groups of people and isolating persons evolving in the scene. Finally, the result of the group and individual trackings is used by the behaviour recognition module, which recognizes predefined scenarios corresponding to specific group and individual behaviours (such as violence or pickpocket in cluttered scene like ADVISOR metro scenes). Using multiple cameras, it is possible to correct some errors of previous modules (like wrong 3D position of persons/groups) and to complete a partial description (like moving regions which have not be classified because of insufficient information).

The ADVISOR behaviour module is able to recognize some predefined scenarios using the output of several modules. A frame-to-frame tracker is used to link the moving region found at time t with those found at the previous (and following) time instant. Such information is useful to build a first coarse temporal and spatial connection between moving regions. Two long-term trackings (one for the individuals, that means for isolated persons, and the other for groups) reconstruct the trajectory (over a wider period of time) of individuals and groups, computing at the same time some useful information like speed or level of agitation. A crowd monitoring module gives information about overcrowding, direction of motion of crowd. Another input for the behaviour recognition module is a priori knowledge of the structure of the filmed scene. This a priori knowledge (= context description) takes the form of a 3D description of the geometric structures of the scene (walls, objects, doors...) plus several information about object and zone properties. The behaviour recognition module uses all these outputs as input to analyse what is happening in a scene. The behaviour recognition module is able to recognize a wide set of complex scenarios, describing actions and behaviours which can happen in image sequences, like "people are jumping above barriers" or "fighting in progress" or "vandalism against an equipment". The recognition is performed incrementally: each scenario can be split into some events (which have to be recognized separately) and some temporal constraints (which have to be fulfilled by the events). This is the first time that such recognition of complex spatio-temporal scenarios is performed in real time and on real scenes. INRIA has shown the efficiency of the behaviour recognition module. The behaviour recognition module has been demonstrated and evaluated on four long videos (lasting several hours) containing interesting and normal behaviours. It has also been demonstrated on one live video at a Barcelona metro station. The behaviour recognition module has a high rate of correct recognition (89%) on the recorded sequences. INRIA managed to recognise 25 blocking behaviours out of 27, 30 fighting behaviours out of 35, 7 jumping over barrier behaviours out of 8, 4 vandalism behaviours out of 4 and 2 overcrowding behaviours out of 2. The behaviour recognition module has low rate of false alarms (6,5%) on the recorded sequences. There were 0 false alarm for blocking, vandalism and overcrowding behaviour, 1 false alarm for jumping over barrier behaviour and 4 false alarms for fighting behaviour. There are still limitations due to the ability to model scenarios (e.g. the description of accurate gestures is not yet formalised), the position of cameras (e.g. an action partly in the field of view of the cameras may not be detected) and tracking errors (e.g tracking an individual in a crowd is still an open issue).

The ADVISOR platform is physically composed by several computing unit, independent one from the others. This solution is more and more widely used because of the computation power it offers. In this framework, it is necessary to split the software platform in several independent modules, working in series: each of them receives the results of the previous module/s and sends its own results to the following ones. The data exchange between modules becomes a main issue, being impossible (or quite hard) for algorithms to share a memory space and/or data structures. In ADVISOR project the XML format has been chosen as a format for data exchange, because of its flexibility, standardization and ease to share between different systems/programming languages. Moreover, the XML data organised, as a hierarchy can be stored directly on a disk and browsed by commercial navigators. The segmentation/classification/frame to frame tracking module sends its outputs to the long-term tracking/behaviour recognition module using XML format. The crowd monitoring module sends its output to the tracking/behaviour recognition module using an XML format. The annotations coming from the behaviour recognition module are in XML form and they are recorded into the archive in XML form. Moreover, the context description (an a priori knowledge of the scene taking the form of a 3D geometric description of walls, objects, doors... plus several information about object and zone properties), used by different modules of the platform is written in XML.

A software module is under development to implement the missing link between image processing techniques, used to analyse the crowd flow, and the behaviour analysis of the extracted crowds. The module is based on Machine Learning techniques that incrementally create a statistical model for each scene (camera), time of the day and week. The learning technique makes use of crowd information (speed, direction of motion, knowledge about moving regions and location of region) to create a set of states. A preliminary analysis of a sample of data was carried out to identify the main directions of motion of crowd flow in one of the selected test sites. The model can be used to reproduce the typical dynamics of a scene, identify anomalous situations and classify the type of scene. The model can also be used to predict evolving scenes and identify regions of interest in camera views, such as hot spots (cash points, access points and gathering points). The innovative aspect of the work is the incremental way information about crowd flow can be learnt, based on simple image processing techniques that extract the velocity of the flow. The information is extracted in the locality of the defined blocs (rectangular portions of image) while the output information is on the form of a statistical model that captures activity of a scene view from a global perspective. Expected benefits include: - Capability of generating simulations of the scene based on the built model; - Use of the model to classify scenes, based on the type of flow, time of the day, dynamics etc; - Use the model to predict the crowd flow evolution given the current trends of dynamics. The use of statistical models to describe a stochastic process such as the dynamics of a crowd scene, or the typical trajectories of vehicles and pedestrians in public spaces is very recent. What makes the implementation of this technique is the use of vector field dynamics to produce a global model of the scene dynamics.

Along with the operational system processing from pixel level up to the recognition of behaviours Orion has conceived a configuration file for semi-automatic setup. A configuration file is defined for each camera. This file contains the 3D scene model specifying the zones of interest and the calibration of the camera. Even if there are still improvements to be done to ease the setup process, the ADVISOR system was able to configure successfully five cameras of both Brussels and Barcelona metros.

The interest of archiving video sequences coming from one/several video surveillance cameras is definitely increased by the availability of a research tool, able to retrieve quickly in the whole database the sequences fulfilling some query constraints. Such a research tool is based on the possibility to associate to each image or to each sequence a description of the activities (events or scenarios) shown by the cameras. The behaviour recognition module developed by INRIA for the ADVISOR project produces annotation describing these activities. The annotation, in XML form, carries the indication of the images they refer, and are sent to the archive module, which stores them together with the images and uses them for answering each query. In this manner, each recognized behaviour becomes part of the archive and can be retrieved, with the corresponding sequence, by a simple query describing its content. This textual information can be seen as an abstract of the stored videos and be used to compute statistics on occurring events during a long period of time.

In the ADVISOR system raw images and their associated annotations are stored in an archive server for later retrieval by surveillance operators. In the retrieval process, multiple search criteria are used to select appropriate image sequences. When an operator selects an image sequence for playback, he/she may use VCR-like commands to display the image stream on his/her monitor (e.g. play forward/backward, fast forward/backward, step forward/backward, pause, stop). The operator's graphical user interface runs on a remote workstation and a specific client/server protocol is used to support the dialogue between this workstation and the archive server. This result shows that XML can significantly ease the implementation of client/server protocols where structured data are to be exchanged. However, general purpose XML tools (e.g. XML parser and generators) should be carefully tailored to closely match the protocol definition in order to avoid too much processing overhead.

In the ADVISOR system raw images and their associated annotations are stored in an archive server for later retrieval by surveillance operators. In the retrieval process, multiple search criteria are used to select appropriate image sequences. These criteria use information, which is available either in the JPEG image header (e.g. spatio- temporal data such as camera name and location, date and time) or in the associated annotations (e.g. event type or alarm type). When an operator selects an image sequence for playback, he/she may use VCR-like commands to display the image stream on his/her monitor (e.g. play forward/backward, fast forward/backward, step forward/backward, pause, stop). This result shows that to achieve a good performance level it is necessary to store in separate spaces the data used for searching and the data used for streaming. The data used for searching are naturally stored in a relational database system, while the raw data (i.e. images and annotations) are stored on separate volumes of the server's file system. Furthermore, an appropriate naming policy for the raw data objects should be carefully defined to maintain their chronological order for streaming purpose.

Moving from single-camera interpretation platforms (like those developed by Orion team before ADVISOR) to a multi-camera interpretation platform (like the ADVISOR one) requires a great effort in terms of software engineering to modify "classic" algorithms and to develop new ones. First the Orion team modified the algorithms of the segmentation/classification/frame to frame tracking module. For example, the same thread (i.e. motion detector thread) has to be able to work, in successive time slots, on data coming from different cameras, without corrupting, during the processing of one camera's data, the intermediate results of another one. The developed algorithms are able to cope with several cameras (four in ADVISOR project). In addition, the team has modified the last modules of the platform (long term tracking and behaviour recognition) to take into account at the same time intermediate results coming from several cameras, merging them in a unique symbolic representation. For that the Orion team bufferise the input of each algorithm to synchronize the data and handle lost data (e.g. when a frame is missing due to network problems).

INRI has demonstrated the interest of Artificial Intelligence techniques to develop an operational real time system. The knowledge based formalism defined in ADVISOR has eased the description of interesting behaviours defined by security experts. The learning techniques used in ADVISOR were interesting methods to tune parameters of complex modules. In particular, these techniques were well appropriate for the behaviour recognition module because of the large amount of parameters and the large variety of behaviour instances.

The HCI provides an operator with a view port into the ADVISOR system. It allows the operator to either monitor real time events from cues provided by the behaviour recognition algorithms or to review recorded information using the retrieval facilities of the Archive System. Key features are: - Manual selection and simultaneous real time display of images from cameras installed in different locations in single or multiple windows mode; - Facility for the operator to annotate the images that he is watching and joined storage of these annotations together with the corresponding images within the archiving server; - A set of tools that allows the operator to search interesting sequences into the archiving server and to retrieve the relevant images previously recorded; - Various ways of alerting the operator of interesting situations occurring in the field of view of any monitored camera of the network; - Various ways to assist the operator in managing efficiently incidents and other alarm situations. The HCI is configurable in order to comply with customer's security management policies and it is realised according to up-to-date ergonomic and aesthetic design rules. The ADVISOR HCI might be considered as a system block of functionality, which is applicable to any surveillance installation featuring digital capture and annotation of CCTV information. As such, it is expected to also be exploited outside the scope of the ADVISOR context by VIGITEC.

Leistungen

Diese Seite teilen

Herunterladen