Multi-Modal Situation Assessment & Analytics Platform

Final Report Summary - MOSAIC (Multi-Modal Situation Assessment & Analytics Platform)

Executive Summary:
The MOSAIC Platform involves multi-modal data intelligence capture and analytics including video and text collaterals etc. The distributed intelligence within the platform enables decision support for automated detection, recognition, geo-location and mapping, including intelligent decision support at various levels to enhance situation awareness, surveillance targeting and camera handover; these involve level one fusion, and situation understanding to enable decision support and impact analysis at level two and three of the situation assessment.
Accordingly MOSAIC has developed and validated:
i) A framework for capturing and interpreting the use-context requirements underpinned by a standard data ontology to facilitate the tagging, search and fusion of data from distributed multi-media sensors, sources and databases,
ii) A systems architecture to support wide area surveillance with edge and central fusion and decision support capabilities,
iii) Algorithms, including hardware-accelerated ones for smart cameras, which enable disparate multimedia information correlation to form a common operating picture, including representation of the temporal information and aspects,
iv) Tools and techniques for the extraction of key information from video, un-controlled text and databases using pattern recognition and behaviour modelling techniques,
v) Algorithms and techniques to represent decisions and actions within a mathematical framework, and how this framework can be used to simulate the effects of disturbances on the system,
vi) An integrated system solution based upon the proposed systems architecture and the above developed enabling technologies including techniques for tagging different multi-media types with descriptive metadata to support multi-level fusion and correlation of surveillance and other data intelligence from distributed heterogeneous sources and networks.
Due to the ability to pre-process events on the camera itself, thus allowing for the pre-filtering of unimportant events, the efficacy of wide-area surveillance can be improved. This is enhanced by the fact that the MOSAIC decision support sub-system supports a more focused and targeted approach to surveillance, i.e. informing on the required deployment of cameras as well as informing already deployed cameras to shift attention or to go to temporary sleep mode, thus further enhancing the reduction of network traffic. Thus MOSAIC has addressed the following limitations of Video Analytics:
1) Minimise False Alerts
2) Create more easily maintainable systems
3) Ensure that surveillance systems are not too costly
i) to deploy
ii) to operate and iii) to maintain
4) Establish a sustainable business case and revenue model for both the surveillance technology suppliers and adopters.
5) Minimise Network (data) Traffic, bandwidth demand, arising from Video surveillance deployment e.g. by intelligently streaming relevant video frames only to endpoints that have subscribed to those events.
6) Minimise the required human attention bandwidth in using VA based security surveillance systems
7) Minimise the storage requirements; particularly as associated with wide-area video surveillance
At the situation assessment and decision support level, MOSAIC integrates advanced criminal and social networks analysis combined with text and data mining techniques to provide actionable intelligence to the police analyst and support his decision making by highlighting any need for further surveillance and monitoring.
Project Context and Objectives:
The advent of the digital age has produced unparalleled access to a proliferation of multi-media data. Text, video, sound and images are now all easily accessible both at a public and private domain level. Additionally, large databases of information exist, from libraries and telephone directories in the public arena, to police, border and security records in the private domain. In more recent years there has been a rapid increase of video surveillance systems deployed by a multitude of disparate organisations.
Video and other surveillance data (e.g. text, audio) is being collected and recorded at an ever increasing rate – and yet at present, technology only facilitates a simple search of the information based upon a few key strings. To facilitate this search, raw data must be manually ‘tagged’ by the user in a non-standardised way. Context (such as location, time of day, direction of view) of the information is generally ignored along with any other associated information or data sources.
A further challenge to be addressed is how temporal information can be captured and utilised when exploring the correlation and fusion of distributed data.
Video Analytics (VA) as one of the main pillars of the surveillance technology originated from motion detection systems which were rooted in the automated inspection market and largely based on blob analysis. Video VA has a crowded marketplace with hundreds of providers claiming to offer solutions some of which may be far from sufficiently robust to realise problem-free and cost-effectively feasible, reliable and maintainable deployment. VA offers to detect video content for every user-defined event of interest. The best VA techniques ignore noise-floor (i.e. non-events) such as normal scene changes in a camera view as well as motion due to weather (e.g. snow, rain) and tree foliage or other non-events such as a pet moving around. This capability associated with the architectural framework for integration of VA which should aim to optimise the locus of the deployment of intelligence/fusion/decision making (at the edge i.e. at the camera or at the central processor) to suit the particular objectives of the application domain. This is a very important feature of the architectural design as it is closely linked to several other desirable performance characteristics such as the degree to which the balance of the local (edge) or central processing would support efficiency and cost-effectiveness such as bandwidth economy in network traffic, operator cognitive load reduction, false–alerts processing effort reduction, storage costs reduction, data protection management load reduction, personnel cost reduction, infrastructural resources costs reduction etc.
VA capabilities range from simple motion detection to sophisticated algorithms for detection of people, vehicles, objects and their behaviour or interactions, pre and post-incident data intelligence/forensics, physical security and business intelligence applications e.g. perimeter violations, crowd detection and monitoring, detection of left objects lurking/loitering people, dwell time video analytics, detection of directional motion, footfall, analysis for retail marketing.
Infrastructural and architectural challenges for surveillance technologies deployment

The issues limiting the use and growth of surveillance analytics e.g. Video Analytics as an important modality includes the need to:

1) Minimise False Alerts
2) Create more easily maintainable systems
3) Ensure that surveillance systems are not too costly i) to deploy ii) to operate and iii) to maintain
4) Establish a sustainable business case and revenue model for both the surveillance technology suppliers and adopters.
5) Minimise Network (data) Traffic, bandwidth demand, arising from Video surveillance deployment e.g. by intelligently streaming relevant video frames only to endpoints that have subscribed to those events.
6) Minimise the required human attention bandwidth in using VA based security surveillance systems
7) Minimise the storage requirements; particularly as associated with wide-area video surveillance
MOSAIC supports enhanced control of surveillance systems (e.g. camera-based), thus allowing the system to focus on certain areas of interest as evidenced by the provided information. To that extent it aims to advance the ability of users to search, create and understand information by correlating and fusing data from distributed sources.
Complexity in Situation Awareness/Assessment
Situation Assessment (Awareness) at various levels is an intrinsic process of all intelligence including surveillance data intelligence which presents a complex problem space arising from a) inferring relationships, b) inferring the states of elements on the basis of estimates of their relationships, and c) recognising or classifying situations on the basis of estimates of constituent elements and their relationships, d) as relationships may not necessarily be observable, logical and semantic ones; in that they may range from physical to functional, conventional and cognitive/abstract relationships; the complexity of situation assessment fundamentally arises from:
- Uncertainty in spatio- temporal evidence,
- Uncertainty in ontological evidence,
- Uncertainty in causality and latent associations
- Uncertainty in relationships representation
- Uncertainty in Hypotheses generation representation and ranking
- The need for efficient schemes for searching evaluation and selection of Hypotheses
- Architectural issues: ontology (including fuzzy memberships and probabilistic dependencies)
- Model management, blackboarding, information partitioning,
- Man-in-the-loop inferencing integration, interactivity presentation and control

Recognition/classification inferencing is the fundamental capability exercised in situation assessment and it largely involves machine vision applied to the video patterns space. Recognition/classification-based inferencing also depends on prior models and presumes a process for generating, evaluating and selecting such models.
The problem is made more difficult when the performance of information sources cannot be assumed. This is the case in network-centric operations, in which calibration and registration are not easily performed. Information systems can involve various degrees of noise and bias errors and various levels of performance characterisation. In the context of distributed video data capture device characterisation, data alignment, calibration, normalisation and data evaluation represent important challenges.
This means normalising and linking information in order to provide the complete picture.
Correlation refers to the following fundamental problem:
- Given ‘n’ reports from one or more sources
- How to determine which reports refer to the same underlying entity

Situation Assessment at various levels includes Correlation and Data Fusion at different levels of sub-symbolic and symbolic (semantic) abstraction. This is generally defined as the use of techniques that combine data from multiple sources and gather that information in order to achieve inferences, which will be more efficient and potentially more accurate than if they were achieved by means of a single source

To address the above issues, the primary objective of the MOSAIC project was to:
Facilitate the correlation of multi-media data from multi-media disparate data sources to form contextual and valuable information –
the information whole being greater than the sum of its parts,
and thus to enable targeted surveillance.

The MOSAIC Platform involves multi-modal data intelligence capture and analytics including video and text collaterals etc. The distributed intelligence within the platform enables decision support for automated detection, recognition, geo-location and mapping, including intelligent decision support at various levels to enhance situation awareness, surveillance targeting and camera handover; these involve level one fusion, and situation understanding to enable decision support and impact analysis at level two and three of the situation assessment.
Accordingly MOSAIC has developed and validated:
vii) A framework for capturing and interpreting the use-context requirements underpinned by a standard data ontology to facilitate the tagging, search and fusion of data from distributed multi-media sensors, sources and databases,
viii) A systems architecture to support wide area surveillance with edge and central fusion and decision support capabilities,
ix) Algorithms, including hardware-accelerated ones for smart cameras, which enable disparate multimedia information correlation to form a common operating picture, including representation of the temporal information and aspects,
x) Tools and techniques for the extraction of key information from video, un-controlled text and databases using pattern recognition and behaviour modelling techniques,
xi) Algorithms and techniques to represent decisions and actions within a mathematical framework, and how this framework can be used to simulate the effects of disturbances on the system,
xii) An integrated system solution based upon the proposed systems architecture and the above developed enabling technologies including techniques for tagging different multi-media types with descriptive metadata to support multi-level fusion and correlation of surveillance and other data intelligence from distributed heterogeneous sources and networks.
Due to the ability to pre-process events on the camera itself, thus allowing for the pre-filtering of unimportant events, the efficacy of wide-area surveillance can be improved. This is enhanced by the fact that the MOSAIC decision support sub-system supports a more focused and targeted approach to surveillance, i.e. informing on the required deployment of cameras as well as informing already deployed cameras to shift attention or to go to temporary sleep mode, thus further enhancing the reduction of network traffic. Thus MOSAIC addresses the following limitations of Video Analytics:
1) Minimise False Alerts
2) Create more easily maintainable systems
3) Ensure that surveillance systems are not too costly
i to deploy
ii to operate and
iii to maintain
4) Establish a sustainable business case and revenue model for both the surveillance technology suppliers and adopters.
5) Minimise Network (data) Traffic, bandwidth demand, arising from Video surveillance deployment e.g. by intelligently streaming relevant video frames only to endpoints that have subscribed to those events.
6) Minimise the required human attention bandwidth in using VA based security surveillance systems
7) Minimise the storage requirements; particularly as associated with wide-area video surveillance

At the situation assessment and decision support level, MOSAIC integrates advanced criminal and social networks analysis combined with text and data mining techniques to provide actionable intelligence to the police analyst and support his decision making by highlighting any need for further surveillance and monitoring.

The following diagram provides an overview as to the high-level concept of the system: Diagram 1.

Project Results:
The MOSAIC project has produced a significant number of modules and components that have been integrated into the MOSAIC framework.
Diagram 2
These include:
- Semantic Representation Framework
- Social/Criminal Network Analysis Tools
- Text/Data Mining Component (part of the SCNA tools set)
- Video Analytics Component
- Smart Camera Component (part of VA)
- Decision Support System

1.3.1 Semantic Representation Framework/ Data Model
The aim of the MOSAIC data model is to facilitate a semantic, “meaningful” representation of the data within the scope of the project. The goal is to use semantics to “represent, combine and share” data for use throughout the MOSAIC system. Semantic representation is generally not bound to a specific type of data model representation: data stored in tables or as tuples in relational databases can clearly be considered to have a semantic representation in that column. Entries represent values with semantic properties ascribed for instance by the respective column label.
For the MOSAIC project, the main interest in selecting and implementing a semantic representation model has been the need to use a flexible representation model that can cater for and integrate quite different data sources (e.g. events extracted from video footage and entities extracted from textual police reports or database fields), and a model that can easily be changed, modified, updated and interlinked with other data models throughout the project as well as potentially after the project for commercialisation and further research and development. These requirements favour a more flexible representation system than for instance representation via a relational database.
Ontological data models have been used for some time in Computer Science in order to represent data in application scenarios with similar requirements such as the ones outlined above, often in order to codify and exchange data between systems. An ontology can be defined as a “formal, explicit specification of a shared conceptualisation” , it provides a finite vocabulary of all elements that constitute a model and defines relations between and properties of the defined elements. Ontological data models create and use a machine-readable and usually human-understandable model that represents the domain under consideration. They are relatively independent from the underlying technical management systems, usually more so than for instance relational databases, which may contain proprietary data formats or functions. Generally, ontologies also tend to be used with a focus on representing the application domain only, and data that is of interest primarily for a specific application is not usually managed as part of an ontology model.
Ontologies have been and are used in many domains, including in biological and medical research. Initiatives towards establishing a Semantic Web infrastructure for the Internet more generally aim to establish mechanisms that allow data providers to provide semantically machine-accessible data on the World Wide Web that can then be automatically processed and used in order to create new services or generate new data based on combining existing data sources in a machine-readable Web of Data. (See Herman for an introduction to the Semantic Web.) Ontologies form an important part of the Semantic Web “stack” of standards and services that has been proposed and for which standards and software implementations have been developed in order to advance towards the vision of a Web of Data. Importantly, Semantic Web initiatives have resulted in standards and generic systems that can be applied in many application domains and are available in implementations that can readily be used as part of state-of-the-art software applications following modern software architecture models.
The Semantic Web stack also provides data representation formats for the representation and exchange of data which can readily be added to suitable ontology-described data models.
All data in the MOSAIC data model is represented in the form of entities and relationships that can form data “triples”. Each of these triples represents a relationship as shown in Figure 1 below.
Diagram 3
Figure 1: Simple example of a subject-predicate-object triple

The figure shows two individuals (indicated above in clear ellipses), “Mike Miller” and “Steve Miller”, who are connected via a directed relation, called predicate, “is parent of”, which is read in the direction of the arrow (i.e. Mike Miller is a parent of Steve Miller).

This figures again shows the relationship introduced in the previous figure between the individuals “Mike Miller” and “Steve Miller”, and it adds another entity, which is the class “Person”. Note the distinction between individuals, i.e. actual persons in the figure and the class (also referred to as “concept”), which describes of which type the individuals are. Classes are part of the definitions made in a domain model and are shown as light grey ellipses here. Relationships/predicates between individuals and classes and between classes are indicated via the dotted directed line and are read in the direction of the arrow as for the other relationship type. Figure 3 below shows another addition to the model:
Diagram 4
Figure 3: Example of a triple connected to a class and to literal

This figure shows two added so-called “literals”, which in this case are characters that indicate a gender (with “m” standing for “male”). Literals are individuals that directly and only contain a data value; no relationships can originate from a primitive individual. In this figure and in the remainder of the document, literals are indicated in rectangular boxes.
Note that relationships/predicates are always directed, there are no bidirectional relations in the model; for the example used in the figure, an additional relationship “is child of” would be used to express the equivalent to “is parent of” in the reverse direction.
The examples above briefly introduce the main elements that are important for understanding the MOSAIC data model: classes, individuals (which need not be persons), literals and relationships/predicates as well as the graphical notation used in this deliverable. Providing a full description of the above model is beyond the scope of this deliverable and can be found for instance in the work of Segaran et al and Thomas Gruber .
The MOSAIC domain model is primarily intended for use with MOSAIC, but is also generic enough to facilitate easy adaptation to domains in which a media-independent representation of a wide range of real-world situations and events is needed. Figure 4 shows the top-level domain model used in the domain ontology (n.b. all depicted subclasses of “Thing” are at the same hierarchical level, but arranged in two layers for display purposes).

Diagram 5
Figure 4: Top-level hierarchical domain model

Table 1 describes the top-level elements of the ontology model and indicates for which main purpose the respective hierarchical level is used in MOSAIC.
Table 1: Top-level elements of the MOSAIC hierarchical ontology model
Diagram 6

These top-level classes are hierarchically structured and contain relevant subclass definitions as is shown for instance in Figure 4 with “Actor”, “Person” and “Group”. Main classes are defined as part of the overall MOSAIC ontology, while more specific classes are defined in sub-ontologies.
OWL enables the decomposition of ontological models into separate models that can then be imported to compose a single overall ontology model. The MOSAIC domain ontology imports an established powerful ontology for the representation of time individuals6. Furthermore, the MOSAIC ontology is divided into a core ontology that contains the main high-level classes/concepts and relationships and a separate “sub-ontology” for each of the top-level classes identified in Figure 5. This allows users of the ontology to begin with a lightweight core ontology and to add ontology concepts to the model they use and need and at the same time serves as an example for how third parties may extend the data model themselves, e.g. for specific application domains or legislative backgrounds.
Lower-level entities in the sub-ontologies defined in MOSAIC have been developed in order to address the requirements of the project demonstration and evaluation scenario.
1.3.2 Social/Criminal Network Analysis Tools
Figure 5 depicts the SCNA subsystem. The subsystem is based on the data and text mining components which extract data from repositories and open source data into the MOSAIC data store. From there, networks can be created, visualised and analysed and new intelligence results returned back to the data store.

Diagram 7
Figure 5: The SCNA subsystem

The social & criminal network analysis is based on 4 methods:
i. Entity Resolution
ii. Automatic Generation and Prioritisation of Networks
iii. Identification of Offender Roles
iv. Identification of Network Themes
1.3.2.1 Entity Resolution
The aim of the Entity Resolution (ER) process is to select and match entities that have a high probability of relating to the same thing or person by attempting to find duplicate or nearly identical data tuples. ER consists of four main sub-modules:

• Data selection – The user chooses what data is to be compared and the criteria to be applied to matching entities;
• Data comparison – The data is compared on a row by row and field by field basis. Each entity will then hold a list, for each field, of other entities that have matching, or nearly matching, values;
• Data matching – Every entity is examined, looking at the lists of matching fields, and calculating whether entities are a match or a near match. Groups of these entities are then formed;
• Post-matching – results are presented to the user, who can fine tune decisions made by the module. These options include adding or removing entities from groups. The results and user changes may be written to a separate file for further analysis.
The real world problem space identified is that when a person’s details are entered into the crime system, the person entering the data should use the look-up available to see if the person is already in the system, thereby utilising the same reference number. However, sometimes due to time constraints, misunderstanding of the system etc. the look-up is not used, and new reference numbers are generated for persons who are already registered in the system. Also, it is reliant on the data being entered correctly. Therefore, if the data entered is slightly different for two or more records (for example, misspelling of surname or forename, or transposing the date incorrectly for the date of birth), but it does in fact relate to the same person, the data will be treated differently.

The aim of ER is to identify data with similar/different surnames, forenames, dates of birth, reference number etc., and to select and match those records that have a high probability of relating to the same person. The way in which this is accomplished is by ER allocating a probability Type of M (matched), P (possible match) or X (no match) to each record when the process is run. ER then allocates an ID to all people with the same or similar name with the same or similar dates of birth, even if the reference number is different.

1.3.2.2 Automatic generation and prioritisation of networks
The algorithms are based on Adderley et al. and are presented as part of the MOSAIC demonstrator.
The development and integration comprises the creation of the algorithms, the creation of a node to be used as part of a data mining process of interconnected nodes, as well as the adaption and extension of existing data manipulation nodes to ease data preparation for this specific node. With this development finalised, a data mining process has been created within the demonstrator which entails the gathering and preparation of data up to the creation of networks at different degrees of freedom.
The data mining process has been extensively tested within the Policing arena involving partners WMP and WP. The same as the algorithms itself, processes are independent of specific data schema meaning a single process can be shared and applied amongst any Police Force assuming the general aim of the process can be met by the availability of data in a Police Force.
Data which could come from crime and offender data sets as well from the text mining component has been resolved and is imported into the process. The data is then joined over a specified reference number, e.g. representing an event such as a crime. In the specific case described here a data set of offenders that have been involved in several crimes and many of them committed together with other criminals (co-offending) is retrieved. For overview purposes, over 800 crime types have been aggregated into a super set.
In the next step, a new today field has been created. This step is optional but needed if one wants to use a decay function in the scoring of crimes and the involved offenders within the MOSAIC data set, for example if one wants to put emphasis on more recent crimes. As MOSAIC data is not directly coming from a current Police data warehouse but was recorded at a certain point in time in the past, the today field is needed to indicate that “today” is the most recent data in the data set, i.e. the gap between today and the last recorded crime is closed. This is not required when using “live” data.
The decay function (recency) is then created, in the case of MOSAIC creating a higher score the more recent a crime (meaning more recent crimes are more important). Additionally, the scoring for crime types based on force priorities is created. Based on this, networks of 2 up to 6 degrees of freedom can be created. Using the recency as decay function and the scoring inside the network calculations, force prioritised networks can be delivered, thus a force can now target networks more effectively with targeted resourced based on their very own priorities.

Diagram 8
Figure 6: Data mining process to create criminal networks with recency, crime scoring and different degrees of freedom

1.3.2.3 Identification of offender roles
Figure 7 shows the enhanced data mining process from data import to the assignment of offender roles and further to identify the “top” offender(s) for targeting in order to disrupt network activity. First data from offender and crime data sets were gathered and linked. Then more than 800 different crime types were aggregated into a superset of violence, burglary, drug crimes, vehicle crimes, and theft.
Diagram 9
Figure 7: Data mining process to assign offender roles

By extracting the total number of crimes per offender on the one side and the number of crimes per crime type from the super set, the ratio of the specific crime types to the overall number of crimes can be retrieved. If a threshold has been reached, a role can be assigned. Depending on the threshold it is possible to assign more than one role to a single offender. At the same time it is possible that no role can be assigned.
1.3.2.4 Identification of Network Themes
Figure 8 shows the process of assigning themes to a criminal network and its sub-groups. The upper part of the process recreates the creation of networks as shown previously. The lower part then tries to identify the network themes.
Diagram 10
Figure 8: Data mining process to assign network themes

The initial stage of this process follows the assignment of roles to offenders first using the same approach as outlined above. This means a super set of crime types is created and the number of total crimes per crime type for each offender is found as well as offenders’ total number of crimes both of which are joined. To retrieve the same statistics for the network, data can be aggregated in order to 1) retrieve the total number of crimes; and 2) the total number of crimes per crime type. Following this, a list of crime types is created with each crime type assigned a ratio of the total number of crimes in the network.
1.3.3 Data/ Text Mining Component
The Text Mining Process is implemented by the following steps:

• segmentation and tokenisation of the sections of the document into individual sentences
• multiword detection (MWT)
• recognition of the entities (NER)
• morpho-syntactic (POS tagging)
• sense disambiguation (WSD)
• functional analysis and semantics

The division of the text into sentences is functional to the detection of very close relationships arising from the logical and functional analysis. Two entities that are linked by a direct relationship such as AGENT- COMPL or AGENT-WHERE will have a very strong bond. Two entities that have a relationship of proximity will have a weak link as the sentences in which they appear are far apart in the text. Once the text is split into sentences, the morphological analysis is performed that aims to identify the part of speech (POS tagging).
The linguistic engine inside the Text Mining node identifies multi-word combinations in the text. This module is closely related to the Named Entity Recognition that identifies, according to a predetermined rule, the entities in the text.
Where there is an ambiguity of meaning, the WSD module assigns the correct meaning of an ambiguous term, depending on the context in which it is inserted. The logical and functional analysis of the sentences completes the analysis process. The result of this is aimed at the identification of the entities present in the text analysed and the identification of relationships existing between them.
The following entity types are recognised in texts:

date = date
time = time
person = name of a person address = address town = name of a town/city country = name of a country continent = name of a continent
location = generic place like "park", "hospital", "station"...
licenseplate = vehicle number plate
organisation = name of an organisation company = name of a company brand = name of a product url = internet web site ip = ip number of a PC
email = email address
bankaccount = bank account details
phonenumber = telephone number number = generic number bodypart = part of a body
vehicle = vehicle type

Heuristic algorithms have been implemented within MOSAIC in order to extract the following kinds of relationships between the entities mentioned above:
Table 2: Entity relationships extracted from texts
Diagram 11

The output of the text mining process is a KAF-XML file. A level for the annotation of named entities and a level for the annotation of entity relationships have been implemented in KAF. These two KAF levels are imported into the MOSAIC data store so that the Social and Criminal Network subsystem can extract the information as part of the entities and relationships.

1.3.4 Video Analytics Component
The following figure presents the design of the VA component.
Diagram 12
Figure 9: The MOSAIC video analytics sub-system

As can be seen, a number of algorithms have been developed and evaluated within the MOSAIC project.
1.3.4.1 Background Subtraction
Background subtraction is used as input for several video analytics modules, among them, left-behind objects and single-camera tracking. Several background subtraction algorithms had already been tested and compared. For the MOSAIC project, a method based on Gaussian Mixture Models (GMM) developed within the Project has been chosen. Specifically, a version with an adaptive variance controlling value, which is also used in order to properly initialise new created modes has been selected. Furthermore, the adaptive variance controlling value is used to trigger a splitting rule so as to avoid over-dominating modes.
The selected algorithm has been improved during the MOSAIC project so as to incorporate region analysis as proposed in Heras Evangelio and Sikora . In this way, it is possible to maintain static foreground objects in the foreground and to rapidly incorporate uncovered background areas into the background. This extension not only enhances the quality of the computed foreground masks (which are used as input for other analysis modules for the tracking of persons) but is also a capital contribution towards the detection of static objects (left-behind objects). The most critical issues of the resulting algorithm from the background subtraction perspective are: the selection of an adequate point in time in order to trigger the region classification and the region classification itself. These two issues have been conscientiously investigated within the MOSAIC project and the results have been published .
1.3.4.2 Optical Flow
Motion information is used as input data for several video analytics in the MOSAIC project. As described in the previous deliverable the estimation of motion is based on the concept of the Optical Flow. Depending on the video analytics, the motion information of the current scene is needed in a dense or sparse manner where dense motion information is defined as a 2D vector field of the size of the respective video frame, where each vector corresponded to a pixel. Such data is used for the abnormal crowd motion analysis. The sparse motion representation has been developed as a set of 2D vectors where each vector corresponds to a position in the image. Such data is used to enhance the single-camera tracking and to track corner like features of the motion based crowd motion analysis. In respect to the application, dense or sparse motion representation, different algorithms have been proposed that differ in accuracy and run-time. In general global optical flow methods are very accurate as they have a high run-time and are used to estimate dense motion information and local optical flow methods that are less accurate but having a low run-time are used to estimate sparse motion information.
All motion based video analytics in the MOSAIC project implement local optical flow methods such as the RLOF and the extension with respect to computational complexity, the BERLOF. As described by Senst et al. a limitation to previous work regarding local methods lies in the rectangular shape of the support region that is used to estimate a motion related to a position. As a result the motion field at motion boundaries tends to be falsely estimated as shown in Figure 10 left.
Diagram 13
Figure 10: Comparing motion boundaries estimated by the BERLOF (left) and its extension using shape adaption (right) for the optical flow field of the RubberWhale sequence .
1.3.4.3 Group Detection
The aim of the group detection module is to detect groups of people standing at different locations such as the entrance of a bar, etc. One of the main problems of this task is how to cope with occlusions, which makes the task of detecting persons by using state-of-the-art detectors difficult. Therefore, a new person detector based on multiple body part detectors which is robust with regard to occlusions has been developed within the MOSAIC project. The main contribution of the proposed detector is the statistical driven way of combining the individual body part detectors in order to detect persons even in case of failure of any detection of the body parts. Thus, persons can be detected with a high reliability both in the case of being completely visible (people in front of the group) or only partially visible (people behind). The proposed approach has been validated and compared with the state-of-the-art on challenging crowded scenes from several public video datasets.
1.3.4.4 Crowd Motion Analysis
The approach proposed for the MOSAIC project introduces the new methodology of extracting crowd features based on long-term motion information in order to achieve independence from the static-camera constraint while still obtaining sufficient information for counting and segmentation of groups. It is based on the assumption that the quantity of image corners in a scene is related to the number of people appearing in a scene. However image corners can be caused by the background too. Therefore a classification based on the different speed of the long-term motion for each corner is performed to label moving corners as foreground belonging to pedestrians. To respect the global motion of the camera the long-term motion will be normalised by subtracting the previously estimated global motion.
Diagram 14
Figure 11: Processing steps to estimate the Multi-Person density map. From left to right and top to the bottom: Original video frame, extracted feature points and path lines influenced by camera shaking, stabilised path lines, colour-coded Multi-Person density estimate

In contrast to previous methods, the number of people is not derived directly by using an image-feature regression. Instead, a Multi-Person Density motivated by the Probability Hypothesis Density will be used to compute the number of people by integrating over it. The Multi-Person Density is estimated using a new image feature-based human model in which the likelihood of person detection depends on the number of FAST features within a region of interest. This likelihood has previously been trained scene-independently using the CAVIAR dataset. Figure 11 gives an example of the estimation of the Multi-Person Density map for a sequence with a shaking camera. The Multi-Person Density map is the base input for more detailed studies such as the detection of groups of moving people as shown in Figure 12 and the counting of people in that group.
Diagram 15
Figure 12: Visual results of crowd density estimation and group detection. From top to bottom:SideWalk, PETS S1.L1 13-57, PETS S1.L2 14-06. From left to right: Original image for SideWalk, stabilised path lines, crowd segmentation, crowd density maps with segment

1.3.4.5 Single Camera Tracking
Single-Camera Tracking in the MOSAIC project has been implemented using a Gaussian-Mixture implementation of a Probability Hypothesis Density (GM-PHD) Tracker which is enhanced using a Sparse Optical Flow.
The GM-PHD tracking uses the tracking-by-detection paradigm and first uses a human detection step in order to find all persons in the image before composing the individual object tracks.
Accordingly, the main issues for this tracking system (as shown e.g. by Eiselein et al. ) occur in cases of a low detection rate where the tracking algorithm has to anticipate the motion of the object and cannot rely on given detections. For human tracking, this is especially relevant because most humans do not follow in their motion a linear model which makes the prediction step difficult.
Although an individual object model can be built using image features such as for example Colour Histograms, Region Covariance or SURF points which can be exploited in cases of occlusion and missing detections, these models can be error-prone under bad image quality and are often computationally too expensive for real-time processing. The MOSAIC Tracker uses a model based on colour histograms in order to improve the performance in ambiguous situations such as occlusions or proximity to other objects.
In order to further enhance the tracking performance, ways of improving the human detection rate have been investigated for the MOSAIC Single-Camera Tracker.
In the first prototype, a human detector based on a histogram of oriented gradients (HoG) of a human head and shoulder was used. A drawback to this method is that it requires a considerable amount of processing power and runs only at 3fps on a standard PC using a CPU implementation. Additionally, due to the use of limited image information (only heads/shoulders are exploited), the detection performance is not optimal.
Therefore, for the final version of the MOSAIC Single-Camera Tracker, a new human detector was integrated which is based on Background Subtraction using Gaussian Mixture Models Its main advantage is the processing time which could be reduced by a factor of approximately 2-3. In its basic version, it only uses background subtraction and blob extraction and is thus well suited for static cameras and sparsely crowded scenes where it has much higher detection accuracy.
1.3.4.6 Multi-Camera Tracking
The objective of this module is to track one object of interest, usually expected to be a human being, across different cameras or camera views. Since path trajectory analysis methods of single-camera tracking algorithms cannot be used to infer the local position of a tracked object from one camera to another one, person re-identification is used in the multi-camera tracking video analytics module. Person re-identification refers to the process of determining whether a given detected or tracked individual has already appeared in a different image or camera view. For the purposes of this project, a re-identification system is applied which is able to recognise a given individual based on visual features only. With this consideration it is mandatory to be able to compute a unique visual signature model for each individual, as well as to compare two models by computing a similarity or distance score or performing object classification with respect to a given model, in order to achieve efficient matching for proper human re-identification.
1.3.4.7 Loitering Detection
Loitering is defined as a person who enters a defined zone within the scene, and remains within the zone for more that t seconds. Loitering detection is implemented relatively straightforwardly in our system. It is a simple extension of trespassing detection. The trespassing detection is performed by checking for intersections between the person’s path and a set of predefined lines that should not be crossed. Loitering is detected by assigning a timer to those who have crossed a tripwire once. If a trespassing event is detected and after a certain time without re-crossing the tripwire (of the same object) the loitering event is fired.
Diagram 15
Figure 13: Loitering detection
1.3.4.8 Mugging Detection
The mugging detection algorithm utilises tracking algorithms (the single-camera tracking algorithm for which preliminary results are provided above) and further processing in order to identify potential muggings. In the system, the method implemented involves the running detector. When two or more objects are detected to run over a relatively short distance, then the mugging event is fired.
Diagram 17
Figure 14: Mugging Detection
1.3.4.9 People Carrying Large Objects
This video analytics is based on the Lagrangian Framework and extended Lagrangian concepts towards detecting individual people carrying objects in video sequences. It makes use of forward- and backward Lagrangian Coherent Structures (LCS) described by the Finite Time Lyapunov Exponents (FTLE) field as integrated spatio-temporal feature space to analyse pedestrian motion behaviour. The pedestrian appearance model will benefit from the Lagrangian descriptors for video surveillance that enables temporal appearance modelling without explicit tracking. This approach is based on detection and offers the advantage of automatically incorporating information across a variable number of time steps into one FTLE field.
The People carrying large objects detection method that is based on time-dependent vector field analysis is integrated into a framework that starts by detecting pedestrians from a sequence of video frames. Therefore a combination of the well-known Histogram of Gradient (HOG) descriptor and a linear SVM is applied to detect the people candidates. The analytic is applied to the Pets2006 dataset that was also used by Damen and Hogg. The classifier is trained using pedestrian samples annotated from the Pets2006 sequence 2 recorded by camera 3. In order to reduce the search space and thus the computational complexity of the classification step, two post processing steps are integrated. At first a foreground blob detection is applied based on the Heras and Sikora foreground segmentation method in combination with a connected component analysis. In a thinly populated sequence each blob detection would be associated with a person detection but Pets2006 contains pedestrian groups that show the HOG detector has to be applied to each foreground blob. Secondly the HOG detector is used in a calibrated environment to deal with projective distortion. The scale of the HOG descriptor is computed directly from the calibration, while assuming a pedestrian is 1.8m in height. Different HOG scales are implemented by resizing the respective image content. For each bounding box the spatio-temporal analysis has been applied. Therefore the optical flow of the integration interval τ has to be computed and stacked to a 3 dimensional tensor. Thus the sequence of optical flow frames is transferred into the space-time domain, which allows the analytics to estimate the particle trajectories, flow map and finally the FTLE field of each pedestrian. Finally, for each pedestrian a HOG-FTLE descriptor has been developed and in combination with a linear SVM classifier the people carrying objects could be labelled.
1.3.4.10 Left-behind Objects
The left-behind objects detection module aims at detecting abandoned objects in public spaces, as these could be considered as a potential threat to the public security. This module has been implemented based on the algorithm presented by Heras Evangelio and Sikora (SOD-CBGM), which makes use of two complementary background models in order to obtain information about the temporality of the detected changes by means of background subtraction. Furthermore, the detected stationary foreground regions are classified as new or removed objects (empty scene background).
Thereby, the background initialisation problem is relocated to the place and right time, where it is needed. Furthermore, by defining some simple operations on the models corresponding to the pixels where static objects were removed, the foreground segmentation results have been considerably improved .

1.3.5 Smart Camera Component (part of VA)
DResearch developed the hardware and firmware of multi-board intelligent IP cameras which feature sufficient computational power to facilitate on-board processing of state-of-the-art video analytics – like the detection of static objects (lost luggage).
The modular design of the MOSAIC smart camera comprises three boards which are connected by standard connectors:
• The System Board with internal and external interfaces and application specific auxiliaries like:
o A GPS sensor – to know where there camera is located,
o An electronic compass – to know the camera direction, and
o A motion tracking sensor – to know immediately whether the camera position has changed.
• The Processor Board with the i.MX6 Quad processor, and
• The Sensor Board with the OV5640 system-on-a-chip (SOC) image sensor.
This flexible design facilitates adaptation and exploitation for video analytics and decision support.
Diagram 18
Figure 15: MOSAIC Smart Camera

One decision support tool which is facilitated by the MOSAIC smart camera is the hardware-based detection of camera tampering events using the 3-axis accelerometer features of the MPU-9150 MotionTracking sensor on the System Board.
Detected camera movement events are sent as ONVIF event notifications to the triple store and/or are displayed at the correct 3D position within the VSAR 3D environment.
The hardware-based detection of camera tampering events is designed to be robust and effective decision support tool which provides greater situational awareness to the operator to determine whether the feed from the camera can be relied upon.

1.3.6 Decision Support System
1.3.6.1 VIKI
The VIKI (View It, Know It) visualisation has been developed over a number of years by BAE SYSTEMS. It uses the CAST vsar-toolkit to generate an interactive mixed reality environment which can display video data projected into a 3D world representative of an observed area. The vsar-toolkit provides the ability to render a 3D scene containing buildings and terrain. It contains additional methods that allow cameras and other objects to be added to the scene.
To improve the functionality of VIKI as a decision support tool, BAE SYSTEMS have developed additional features to enable its use within MOSAIC as a CCTV Command and Control system that provides situational awareness to the CCTV operator. These include the display of ONVIF objects, the addition of live and playback modes using BAE SYSTEMS Universal Video Management System (UVMS) and the real-time location display of agents with body worn GPS.
The VIKI user interface is shown in Figure 15 below. The main widget, on the right hand side, shows a 3D representation of the area under surveillance. On the left there are four widgets that offer additional control or information to the user:
A. Playback mode selection
B. Time display and selection
C. Object list view
D. Metadata view
Diagram 19
Figure 16: VIKI Main Window

The 3D visualisation can be manipulated by the user using a mouse. The mouse actions control the virtual user ‘camera’ with which the scene is viewed. The mouse wheel allows the user to translate the scene along the forwards-backwards camera axis. Holding the middle button while moving the mouse causes the scene to be movedin the direction of the camera left-right and up-down axes. Holding the right button while moving the mouse rotates the scene around the camera left-right and up-down axes.
The visualisation also shows camera, event and object positions. The user can move directly to camera views using the mouse. To move to a camera, the user must click on it simultaneously with the left and right mouse buttons. This will switch to the first of two camera view modes (shown in Figure 16) and additional clicks will then toggle between the two modes. The first mode is an overview looking down on the camera location and the second is a ‘first-person’ perspective, viewing from the perspective recorded by the camera.
Diagram 20
Figure 17: VIKI Camera Views
1.3.6.2 Template Matching
The Template Matching Decision Support (TMDS) receives real-time events from the NVAs and accesses the DataStore to look for related events. The Advanced Ontology-Based Reasoning System is integrated into the DataStore and operates on existing events. Both of these decision support tools can automatically carry out tasks in response to successful rule matches through a common response server.
Diagram 21
Figure 18: TMDS in MOSAIC Environment
1.3.6.3 Ontology-Based Reasoning
The ontology-based reasoning system developed by the University of Reading focuses on an integrated and flexible rule-based reasoning system that operates directly on the ontological world model defined for the MOSAIC system.
Inference systems have been integrated into the DataStore on two levels:
- Ontology (or Description Logic) reasoning in order to make the full power of the OWL ontology developed available to system users;
- Rule-based data processing using a Production Rule system in order to allow users to match events, situations and templates and effect specific actions when a defined rule is triggered.

Both are integrated closely with the MOSAIC DataStore in order to improve the overall system performance and in order to make Ontology reasoning features available via the general DataStore Web Service interfaces.
Ontology reasoning is used in order to derive additional information from data already stored in the MOSAIC DataStore. In addition to other operations, the basic functionality provided by ontology reasoning is that the reasoning engine “explodes” a data model so that implicit relations and properties that are represented in the data model become explicit and available for instance when querying the “exploded” data model. A simple example for this is that instances that are subclasses of “car”, say instances of “compact car”, would be returned as instances for a query that queries for instances of “car” without explicitly also including “compact car” or subclasses in a more abstract manner
1.3.6.4 Video Indexing
The video indexing and summarisation system developed for the MOSAIC system provides the viewer with an overview of the content of a video. For that purpose, it first finds the relevant information contained in the video to be summarised (video indexing), and, based on that, provides indexes and summaries (consisting on adaptively accelerated versions of the summarised video) in order to allow the users to rapidly grasp the extracted information and to navigate through it.

Potential Impact:
By providing the solutions in decision supported targeted surveillance, the MOSAIC project has a major impact on Bandwidth Economies:
1) NETWORK TRAFFIC
Due to the ability to pre-process events on the camera itself, thus allowing for the pre-filtering of unimportant events, the network traffic can be substantially reduced. This is enhanced by the fact that the MOSAIC decision support sub-system will support a more focused and targeted approach to surveillance, i.e. informing on the required deployment of cameras as well as informing already deployed cameras to shift attention or to go to sleep mode, thus further enhancing the reduction of network traffic.
2) HUMAN COGNITION
The MOSAIC decision support system can inform law enforcement agencies of potential needs for increased or decreased need for surveillance. Consequently this will decrease the need for human intervention and would help to shift human resources to where they are needed.
3) PROCESS
Due to the decreased amount of data that is being captured, less business processes are involved. Such processes normally are data encryption, anonymisation, storage, retrieval, etc.
This draws additional resources that is being saved by the MOSAIC platform.
4) STORAGE
As with any monitoring system, common (video-) surveillance systems generate huge amounts of data that need to be stored and retrieved when necessary. The novel MOSAIC solution supports the reduction of generated data, thus reducing the amount of data to be stored.

In addition the results of the MOSAIC project has a major impact in the advancement of research in the areas of analytics which include:
• Data Mining/Speech Analytics
• Social Network Analytics
• Footfall Analytics

These can be applied to various domains such shopping behaviour analysis, anti-terrorism applications, etc.
From a business model perspective MOSAIC will has the following major impacts:
a) Adaptability
The flexibility of the decision engine allows for the adaption of MOSIAC to various application domains and requirements.
b) Cost
Through the aspects mentioned above (bandwidth economics), considerable savings can be made in wide area surveillance whilst pertaining if not increasing the accuracy and level of threat detection. The reduction of captured video footage reduces the required time to scan and archive the video footage, thus reducing cost considerably.
c) Ease of installation & Maintenance
Through the advancement in deployment techniques and strategies the installation of surveillance systems will become increasingly easy. This will make intelligent surveillance more affordable and will have a direct impact on the security of citizens.
A final impact, that is not to be underestimated, is the improvement of privacy of citizens in Europe whilst maintaining a level of security. By using the MOSAIC platform to target only specific foci of interest in the wider area, less infringement of privacy will occur as less video footage is being taken.

List of Websites:
Project Website: http://www.mosaic-fp7.eu
Contact Details: Prof. Atta Badii
Intelligent Systems Research Laboratory, School of Systems Engineering, University of Reading, RG66AH, United Kingdom
email: atta.badii@reading.ac.uk
tel: +44 118 378 7842
fax: +44 118 975 1994

Final Report Summary - MOSAIC (Multi-Modal Situation Assessment & Analytics Platform)

Related documents

Share this page Share this page on social networks

Download Download the content of the page