Strategic Crime and Immigration Information Management System
BAE SYSTEMS INTEGRATED SYSTEM TECHNOLOGIES LTD
Warwick House, Po Box 87, Farnborough Aerospace Centre
Gu14 6yu Farnborough
Private for-profit entities (excluding Higher or Secondary Education Establishments)
€ 665 427
Claire Dance (Ms.)
Sort by EU Contribution
INDRA SISTEMAS SA
€ 359 405
COLUMBA GLOBAL SYSTEMS LIMITED
DENODO TECHNOLOGIES SL
€ 338 869,25
ELSAG DATAMAT S.P.A.
MAGYAR TUDOMANYOS AKADEMIA SZAMITASTECHNIKAI ES AUTOMATIZALASI KUTATOINTEZET
€ 252 726,25
UNIVERSIDADE DA CORUNA
€ 164 943,20
SELEX SISTEMI INTEGRATI SPA
€ 247 707,50
GREEN FUSION LIMITED
€ 289 918,25
Grant agreement ID: 218223
1 November 2009
31 October 2012
€ 3 595 562,80
€ 2 318 996,45
BAE SYSTEMS INTEGRATED SYSTEM TECHNOLOGIES LTD
Tackling trans-European crime
Grant agreement ID: 218223
1 November 2009
31 October 2012
€ 3 595 562,80
€ 2 318 996,45
BAE SYSTEMS INTEGRATED SYSTEM TECHNOLOGIES LTD
Final Report Summary - SCIIMS (Strategic Crime and Immigration Information Management System)
The 'Strategic Crime and Immigration Information Management system' (SCIIMS) project has developed new and innovative capabilities and technologies in the field of Information Management and Information Exploitation (IM/IX) for combating ‘People Trafficking’ and People Smuggling’. These improve the ability to search, mine and fuse information from massive datasets thus increasing situational awareness and improving decision-making.
Input from user groups active in working on immigration, intelligence and organised crime as well as subject matter experts was used to guide the research activities.
The operational problem space was modelled and used to engage the users for better understanding capability gaps, and IM/IX problems. Security technologies and policies were assessed and analysed in detail. The information obtained was used to produce set of user requirements and a set of collaborative requirements were produced to inform the design of systems and innovative capabilities.
The main research areas were unsupervised web crawling and navigation, and data mining entity resolution and visualisation, with research papers being published for all of these. A study was carried out into the problems of representing an investigation within a team, while avoiding making premature conclusions, and the need to justify any actionable findings.
An integrated demonstration system was produced implementing the developed tools and algorithms in a Service Orientated Architecture (SOA). This system supports the Pirolli-Card (P-C) cognitive model and allows an investigator to both forage for information and to make sense of it by logically constructing the investigation argument and evidence supporting it. These are often not clear cut; this system allows hypotheses and probabilistic information to be used when required. Central to this is the use of an Ontology that allows a much closer representation of the real world domain than a conventional relational database and allows much more detailed and varied information to be found.
An Adversaries Model describing the roles of the people trafficking criminals and investigator provided a scenario for the demonstration system which was used to conduct demonstrations and experiments to gauge the effectiveness of the developed capabilities. An analysis of the generally positive results was carried out.
The potential exploitation of the developed capabilities and technologies is described in an exploitation plan and route map. Possibilities include maturing SCIIMS and developing it further as a criminal investigation system, and making use of the developed IM/IX technologies and knowledge gained which are applicable to many other areas.
The SCIIMS project has made numerous innovations and developed beyond state of the art capabilities (including for Data Mining, Web Crawling, RTF data virtualisation, investigation representation, Ontology and IM/IX tools, and the Adversaries Model amongst others). The advances made by the project contribute to a much-needed technology base for IM/IX and greater knowledge for enhanced security capabilities, and help combat the ill effects of organised crime.
Project Context and Objectives:
People Trafficking and People Smuggling has long been a problem for European Governments, adversely affecting the security of their citizens. In many cases women and children are forced into the sex trade and subjected to labour exploitation. In formulating the SCIIMS project the consortium is focusing upon an overarching research question from which the developed capabilities, demonstration and experiments will be focussed.
In the European Union context how can new capabilities improve the ability to search, mine and fuse information from national, trans-national, private and other sources, to discover trends and patterns for increasing situational awareness and improving decision making, within a secure infrastructure to facilitate the combating of organised crime and in particular people trafficking/smuggling to enhance the security of citizens?
The SCIIMS Consortium has utilised ‘State of the Art’ products which form the base capability on which to develop new innovative capabilities and technologies. This approach was designed to provide an early exploitation opportunity for the consortium and the user groups involved.
Project context and objectives
The programme addressed the following research areas:
a) Development and application of Information Management (IM) and Information Exploitation (IX) techniques enabling information to be fused and shared nationally and trans-nationally within a secure information infrastructure in accordance with European crime and immigration agencies information needs;
b) Development and application of tools to assist decision making in order to predict and analyse likely People Trafficking and People Smuggling sources, events and links to organised crime;
c) Utilisation and enhancement of existing ‘State of the Art’ products to develop and incorporate new capabilities, ‘Beyond State of the Art’ into product baseline in order to speed introduction of new innovative techniques, technologies and systems.
To achieve these objectives the following activities were undertaken:
a) Capability and Technology Research including involvement of appropriate User Groups to identify specific capability gaps. This includes an analysis of legislation e.g. data protection, privacy, as well as security infrastructures;
b) Production of a collaborative requirements set including User and System Requirements in order to inform design of systems and innovative capability;
c) Design and development of applications and algorithms to enable trend analysis of information, information fusion, data mining/integration and decision tools;
d) Define and research technology route maps and exploitation plans to enable the consortium to exploit the developed capabilities and technologies as well as disseminate findings and recommendations to the user community and European Commission (EC);
e) Conduct of selective system testing including demonstrations to selected users in order to seek input and recommendations for other capabilities;
f) Plan and conduct experimentation in order to verify and measure the improvements and advantages of the developed capabilities and technologies over and above an agreed baseline. Analysis of the completed experimentation was carried out, in order to inform further iterations of development within the programme.
The SCIIMS project has developed new and innovative capabilities and technologies in the field of Information Management and Information Exploitation (IM/IX) using an exemplar of ‘People Trafficking’ and 'People Smuggling’.
The Project focused on the following:
a) Innovative research into improving the ability to search mine and fuse information and to discover trends and patterns for increased shared situational awareness.
b) Developing applications and algorithms based on the research, and integrating them into a Prototype/Demonstration system along with existing technologies. This demonstrates how the technologies and beyond state of the art capabilities can be used together.
c) Experiments to measure the effectiveness of the developed capabilities.
d) An Ethics Advisory Board (EAB) provided guidance and recommendations to ensure that SCIIMS meets the EU ethical and legal requirements, including data security.
User Group feedback (augmented by subject matter experts) was used to help define the requirement and research directions. Their feedback from demonstrations and experiments was used to gauge the effectiveness of the developed capabilities and the system as a whole.
The SCIIMS consortium is made up from a mix of industry, small medium enterprises (SME), and academia as follows:
a) BAE Systems (UK) - Programme Coordinator
b) Defence Fusion International (Ireland) – SME partner
c) Denodo Technologies (Spain) – SME partner
d) Indra Sistemas (Spain) –Technical Lead
e) Selex SI (Italy) – Industrial partner
f) Sztaki (Hungary) – Academic partner
g) University of A’Coruña (Spain) – Academic partner
Research focus and user group engagement
User Group Engagement
The project planned to use feedback from User Groups to ensure that the research was in the correct direction, to help define a set of user requirements for a Prototype/Demonstration System and also for demonstrations and experiments.
However a general lack of User Group Engagement was a problem for the project throughout. Attempts to engage the following were unsuccessful: Italian Carabinieri, Italian Police (Planning and Immigration departments), Italian GDF (Guardia di Finanza), DAC Agency (Direzione Centro Operativo), EUROPOL, and the UK Boarder Agency.
The project did have four participating user groups: the Hungarian National Bureau of Investigation (NBI), Spanish Intelligence Centre for Organised Crime (CICO), the Irish Naturalisation and Immigration Service (INIS), and the Serious Organised Crime Agency (SOCA).
Progress with these User Groups was very slow with it proving very difficult to arrange meetings and receive feedback from them. There are many different reasons for this varying from staff having no motivation to talk to the project because they would have to spend their own time doing so, elections (for instance in Spain) affecting the organisation of the user groups, and restructuring or potential replacement of the User Group (for instance SOCA by the National Crime Agency (NCA)).
In order to overcome the impact of this the project used the alternative strategy of consulting a number of Subject Matter Experts (SME) to provide the project with necessary guidance.
However the project did receive some feedback mainly on the type of systems used and about data sharing between organisations. Latterly SCIIMS was demonstrated to the NBI (two IT experts from the Hungarian Constitution Protection Office were also present) and they provided some useful generally positive feedback.
Ethical Advisory Board (EAB)
An assessment of societal and ethical issues raised by the programme has been conducted by the Ethics Advisory Board (EAB). An initial Societal Legal and Ethical Issues report was produced soon after the project started specifying framework guidance against which the project will operate. The guidance included the following which has been summarised. EU countries have data protection laws in accordance with directive 95/46/EC:
a) Personal Data. All personal data is must be obtained in a fair, lawful form including public data. Data use and management must comply with a legitimate explicit and determined purpose.
b) Individual rights. An individual’s right to know about, or object to the use of data and challenge the interpretation of that data must be guaranteed. Avoid attributing features, profiles, identities or certain aspects of personality.
c) Data Segregation. The organisation and storage of data must avoid prejudice and bias. The EAB recommends the establishment of different databases for different subject categories.
d) Data Security. Data security must be achieved by access control, audit mechanism and appropriate encryption (data at rest) (data in motion).
Three, yearly, reports followed based on an explanation of the work being carried out and an assessment of the deliverables produced. None of these reports identified any significant ethical issues with the work carried out.
To ensure there were no data protection issues with the data being used by the project a SCIIMS Dataset document ( written by BAE Systems with input from the consortium members) that described the all the datasets used and how they are generated was produced. This has been reviewed by the EAB with no issues being identified; this is covered in the report for the second year of the project.
The EAB also assessed whether there would be any issues for a real-life deployment of a SCIIMS system. Their conclusion was:
“This analysis shows that legal and ethical issues covering personal Data, Individual rights, Data Segregation and Data Security do not preclude the use of a deployed SCIIMS system, given that all the aforementioned regulations are respected.”
A Dual Use Analysis document was produced by BAE Systems and the EAB did not identify any significant issues.
Both the issues for a real-life deployment of a SCIIMS system and dual use are included in the final EAB report.
User Domain and Research Report
The initial focus was concentrated upon the understanding of the problem space and the modelling of an appropriate business process that would reflect the actions of a crime intelligence analyst. This was carried out by BAE Systems utilising IDEF and SysML modelling techniques in order to produce appropriate operational views of the likely business process. In general terms the following modelling views were produced:
a) High level operational concept;
b) A capability vision;
c) A capability taxonomy;
d) An activity model.
The activity model was of use when engaging with users in order to validate possible business process and thereby understand what capabilities and later on services would be needed to conduct specified tasks. It was found however that each Member State has different ways of approaching the problem and therefore the activity model provided a generic view of the business processes involved. The modelling was instrumental to ensure the derivation of User Requirements.
The vision and research focus along with use cases has been captured in User and Domain Research Report (delivery No 2 Annex I) written by Selex.
Security Requirements and Analysis, Infrastructure, Data Protection Policy and Controls
The Security technologies and policies were assessed and the findings captured in Security Requirements Analysis Infrastructure and Controls Report, (delivery No 4 Annex I) by Selex. Significant findings have included the highlighting of the data protection requirements that influence design and implementation of the capabilities being developed. These also take into account the EAB Framework. These include:
a) Access Control: information must be accessible only by users that have rights and have been authorised to access them.
b) Compartmentalisation of data: users may only access those parts of the information for which they are responsible
c) Data Integrity: managed information must be protected from data tampering, ,damage, or deletion caused by unauthorised users or by accidental events;
d) Data sharing between organisations
e) Data Anonymity: data before analysis must be disguised to ensure that people identities are protected, and only revealed at appropriate levels;
f) Events Traceability of access or attempts to export data without authorisation;
g) Secure data transfer
h) Data Purging: sensitive data no longer required is removed from the system
i) Categories Splitting up: separation of different categories of data to avoid , prejudice and possible unjustified discrimination;
j) Aggregation control: aggregation of information belonging to different categories is only possible for users who have the explicit right to do so;
k) Trusted computing base: sensitive data must be maintained in a trusted environment, and handled by trusted system administrators;
l) Information Flow: no information belonging to different categories may be copied from one to another unless this is allowed by the access control policy.
Encryption techniques and other security technologies were investigated to inform the research how the documented security requirements can be met.
The project has carried out Web Crawling and Data Mining Research by the Unversity of A’Coruna and Sztaki respectively.
There are two approaches to creating programs able to automatically browse and extract structured data from websites:
a) Semi-automatic or supervised, which requires creating specialised “wrappers” for each target website. The wrapper needs to be updated every time the website format changes. Wrappers are created and maintained by human administrators.
b) Automatic or unsupervised which:
i) requires an initial description of the intended data extraction task created by a human expert;
ii) uses automatic navigation techniques to explore the target websites searching for relevant data (according to the aforementioned initial description);
iii) recognises and learns to interpret and use relevant query forms;
iv) automatically extracts the relevant data contained in the explored page as structured data.
The unsupervised approach tends to produce less accurate results (rounding 90% accuracy) but, in turn, they require very little human intervention and, therefore, can scale to 100s of websites.
Research has been conducted on improving automatic web crawling, focusing on improving the extraction of data in the unsupervised approach. Research was carried out into the kinds of websites that will be relevant to SCIIMS. These included online classified advertisements, social networks and job websites. An analysis of the state of the art techniques showed a number of data extraction problems with these types of websites. As a result of this an improved algorithm has been produced which has the following beyond state of the art features:
a) it requires only one page containing a list of data records as input;
b) it handles nested data structures (sub lists);
c) it enables the linkage between master-details pages;
d) it has a series of heuristics that help to remove unwanted and unimportant data (banners, navigation links, menus, etc.).
Experiments have been carried out to test the algorithms effectiveness. These experiments used 170 well-known websites in five category domains, including those most relevant for SCIIMS and achieved the following results.
a) totally correct records: above 86% recall and precision;
b) partially correct records (most of the data fields are correctly identified, but some attributes are missing): 92.5% precision and 93% recall.
Research has also been conducted to improve the automatic navigation element of Web Crawling, and producing research papers covering both this and the automatic extraction of data from the web.
Automated web navigation with conventional browsers is computationally expensive, especially with the new breed of websites (Ajax, etc.). In complex pages, CPU usage is high due to script execution, and memory usage is also significant because the entire Document Object Model (DOM) tree of the page is loaded into memory. This is a problem for automated navigation where 10s or 100s of browsers may be executing different sequences at the same time.
A web automation browser knows the navigation sequence in supervised mode because the navigation sequences are configured in advance. In unsupervised mode the navigation sequences are known after the first execution of the crawler, when the routes to the relevant information are found. Therefore, in theory, the browser could load only those elements and execute the scripts which are needed for the target sequence.
It is difficult to know the elements required by a certain navigation sequence because of the complex dependencies between elements and scripts. The approach that has been taken is to carry out a test execution and collect information on the dependencies by monitoring events of the embedded script execution engine. Only the required elements are loaded or executed for subsequent executions.
Experiments were carried out to test the effectiveness of this. Twenty seven well-known websites were selected from the top fifty Alexa Websites and example sequences which automate the main function of the websites were carried out. This showed that:
b) On average HTMLUnit (open source custom browser for web automation testing) is 3.15 times slower and Microsoft Internet Explorer is 5.03 times slower.
The following research papers have been written:
a) Efficient Execution of Web Navigation Sequences
i) Work on automating navigation sequences in complex websites published in the journal Data and Knowledge Engineering from Elsevier. Journal indexed in the Computer Science - Information Systems area of the Journal Citation Reports, in the position 41 out of 135. Impact Factor: 1.422.
ii) Presented at the 13th International Conference on Web Information Systems Engineering (WISE). Conference indexed Category A (highest category) in the core.edu.au conference index. 18% acceptance rate.
iii) Extended version of the optimisation work submitted to the journal IEEE Transactions on Knowledge and Data Engineering Journal indexed in the Computer Science - Information Systems area of the Journal Citation Reports, in the position 29 out of 135. Impact Factor: 1.657
iv) A work-in-progress paper was presented at ZOCO 2012 Workshop on Agents and Multi-agent Systems for Enterprise Integration. . This workshop was part of the 10th International Conference on Practical Applications of Agents and Multi-Agent Systems (PAAMS) conference 2012.
b) Automatically Extracting Complex Data Structures From the Web
i) Presented at the 4th International Conference on Knowledge Discovery and Information Retrieval2012. 28% acceptance rate.
ii) Extended version has been submitted to Elsevier Data and Knowledge Engineering from Elsevier. Journal indexed in the Computer Science - Information Systems area of the Journal Citation Reports, in the position 41 out of 135. Impact Factor: 1.422.
Research has been carried out into entity resolution, which covers the problem of identifying distinct representations of real-world entities in heterogeneous databases. Two interrelated problems are to be distinguished, and to be tackled with the following approaches:
a) Attribute-based methods which consider the input data as a set of records made up of attributes and use a resolution process based on record similarities.
b) Link based methods which receive two graphs (networks of entities, with links between the entities as nodes) as input. The goal of the resolution process is to produce a resolved entity graph, where nodes are entity instances holding entity records together.
The central question of Entity Resolution with largest influence on quality is to find the accurate rule for identifying data records. The quality can be measured by counting the number of correctly identified records and the number of incorrect identifications. The other central point is efficiency. Using a brute force approach one should compare all pairs of records, which results in an algorithm of running time proportional to n^2, which is unfeasible for large amounts of data.
The trade-off between the storage needs and the efficiency of Entity Resolution in relational database management systems (RDBMS) of real data has been examined (see paper Entity Resolution with Heavy Indexing below).
In the case of large amounts of data, standard solutions in relational databases are inefficient. As a possible solution for scaling the algorithms, Sztaki studied several distributed paradigms to find reliable and scalable solutions (see paper Infrastructures and Bounds for Distributed Entity Resolution below).
In the accepted paper (Flexible and Efficient Distributed Resolution of Large Entities) Entity Resolution problems with complex notions of an entity are considered. These algorithms are beyond state of the art, and can efficiently incorporate probabilistic and similarity-based record match, enabling flexible match function definition.
Developing link based Entity Resolution (LER) algorithms is an on-going research area. A couple of algorithms have been implemented, evaluated and compared to a baseline method run on an anonymised real-world dataset. Sztaki intends to extend the set of available algorithms and perform more measurements to tweak parameters of the algorithms. They are testing their algorithms using a collection of customer promotion game activity logs conducted by an insurance company to identify potential advertisement target groups. The key concept of these online games was users increasing their winning chance by inviting others.
Research has been conducted into techniques for the visualisation of heterogeneous data sets as a network of entity nodes with arbitrary connections. Special rules (according to defined business logic) map specific information content into links to form the network.
As a result improved and new tools for visual analytics have been produced. The collected data is represented as a network of heterogeneous entities. This representation is a new way of using collected law-enforcement related data. The implemented network model is provided with user friendly search, browsing and flexible visualisation tools. Additional tools provide the possibility to find similar entities and also complex patterns in the network.
The framework can be used to visualize large data sets. Experiments were conducted with the documents of Wikipedia, so this tool can handle over millions of documents (nodes) and hundreds of millions of interconnecting edges. To explore such a complex data set, users are supported by advanced search technologies. New methods were developed to select relevant subnetworks for textual queries. These new algorithms are extensions of standard information retrieval techniques. Sztaki are currently conducting experiments to incorporate visualization of dynamically changing networks into the framework.
The following research papers related to SCIIMS have been produced:
a) Csaba I. Sidló: Entity Resolution with Heavy Indexing, ADBIS 2011 (Advances in Databases and Information Systems 2011 conference)
b) Csaba Istvan Sidló, András Garzó, András Molnár, András A. Benczúr: Infrastructures and Bounds for Distributed Entity Resolution ,QDB 2011, (9th International Workshop on Quality in Databases In conjunction with VLDB 2011, August 29th, 2011, Seattle, USA)
c) András J. Molnár, András A. Benczúr, and Csaba István Sidló: Flexible and Efficient Distributed Resolution of Large Entities, FOIKS 2012 (Foundations of Information and Knowledge Systems)
Sztaki participated in the IEEE VAST challenges and were awarded as follows:
a) N. Bánfi, L. Dudás, Zs. Fekete, J. Göbölös-Szabó, A. Lukács, A. Nagy, A. Szabó, Z. Szabó: City Sentinel – VAST 2011 Mini Challenge 1 Award: “Outstanding Integration of Computational and Visual Methods”
b) L. Dudas, Zs. Fekete J. Gobolos-Szabo, A. Radnai, A. Salanki, A. Szabo, G. Szucs: OLAP approach in anomaly detection - Award: Good Support for the Data Preparation, Analysis, and Presentation Process. IEEE VAST Challenge 2012
The following documents have been produced covering the research:
a) Mathematical Models and Proof of Concept (delivery No 5 Annex I).
b) Mathematical Models and Proof of Concept Volume 2, Supplement to the Research
c) A Web Crawling and Data Mining summary document
A set of user requirements has been captured by Indra and DFI. The user requirement set was extensively reviewed by the consortium using SMART methodology as not all user requirements captured could be implemented within time and budget. The user requirements are captured within a database structure which formed part of the User Domain and Research Report (delivery No 2 Annex I) and User Requirements (delivery No 6 Annex I).
A set of system requirements has been written by the consortium members and analysed. These are captured in the same database as the user requirements with tracing between them.
Both the user requirements and system requirements have been base-lined.
The full requirements set, if implemented, would have exceeded time and budget constraints of. To ensure this did not happen, the requirements were tagged to identify whether they are essential or desirable to the Programme.
An Adversaries Model has been developed by BAE Systems describing the side of both the criminals and of the official organisations, explicitly stating all the roles involved in both. The crime and investigation are depicted as UML-style state diagrams within which the actions of each individual are shown in rows with arrows indicating the interactions between the individuals. The Adversaries Model also details how the essential tasks and services, such as the Web Crawler, are used when executing SCIIMS. The model covers one typical scenario.
The Adversaries Model has the following innovations:
a) It specifies in detail the test data required and how it should be configured for the crime scenario. This data can be used for demonstrations and experiments, for instance it can be hidden in large volumes of information fed into SCIIMS.
b) It can be used to derive the specification of the operational part of the Ontology.
This model been used for Data Mining demonstrations and as the basis for demonstrations and experiments.
An overall System Design Specification (delivery No 8 Annex I) which describes the hardware and software architectures and the modules has been written by Indra with contributions from the other consortium members. Relational Databases are used for storing foraged data. A fundamental component to the System Design is the use of an ontological and semantics approach to describing the underlying information and the management of such information.
System design and interface specifications have been written for each module which consist of:
a) Data Mining (Sztaki) – Module 4 has three sub modules: Entity Resolution, Classification and Network. Entity resolution uses various techniques to identify the under lying entities from complex heterogeneous data sets. Classification trains the algorithms to classify data by using annotated example data for instance using web advertisements and annotating them to indicate whether they are recruiting for the purpose of people trafficking. The Network sub module constructs the information for use by the Graphical Viewer module. The ability to carry out data mining is a key aspect of hypotheses analysis.
b) Data Services Layer (Denodo) – The purpose of this module is to combine, integrate and transform information obtained from the different data sources. The information can then be used by the other SCIIMS modules. This module supports different types of data and protocols e.g. a web service with SOAP protocol. Data virtualization technology for RDF / OWL has been developed which is beyond state of the art,
c) Web Crawler (University of A’Coruna) – This component is a module that enables the SCIIMS system to collate data from Open Source Intelligence (OSINT) sources, in particular, from WWW Web sources. This module also has the capability of crawling the Deep Web and provides different ways of nominating target websites. It provides mechanisms for interrogating websites without leaving any footprints. Web crawling can be either semi-automatic (supervised) or automatic (unsupervised). The later method allows the crawler to adapt to changes in the target websites.
d) Visualisation & Information Access (DFI) – This module is the primary graphical interface by which users will interact with the SCIIMS system using constructs such as User Management, Investigation Management (including moving data between artefacts (e.g. shoebox -> evidence file)), Alert Management, Searching (external sources and SCIIMS repository), Ontology Querying and Viewing, Network Functions Graphical Viewer (shortest path, entity resolution) and the Schema Viewer for the overall conduct of the investigation.
e) Ontology Model (Selex with input from BAE Systems) – Module 12 forms the ontology model and is a formal explicit description of concepts in a specific domain (classes, (sometimes called concepts)), properties of each concept describing various attributes of the concept (sometimes called properties) and relationships representing the link between concepts.
f) Graphical Query Interface (Selex) – The graphical query component can be considered an important step of the data analysis since it is in charge of extracting information from the knowledge base and making it available for the other modules that need such information extracted from a Semantic Repository. The standard way to query a semantic repository such as the ontology is by using SPARQL statements. This module provides an interface that permits the user to easily build the query without needing to know the query language. The query can be graphically built by selecting entities, attributes and relations from the underlying data model. The user can also save and load graphical queries for further use.
g) Graphical Viewer (Sztaki) – The purpose of this module is to provide a visualisation of the network of entities and to provide an easy-to-use user interface for searching and browsing the database of entities. Search facilities include structured search, full text search and similarity search for text based attributes and searching for shortest paths between two entities. Sub-graph/pattern searches can be carried out which only display sub-graphs that meet specified criteria. Different layouts can be selected by a user to help analyse the data.
h) Analysis of Competing Hypotheses (ACH) (BAE Systems) – An analyst may need to carry out a probabilistic analysis on the alternative hypotheses. This module allows this to be done by using Bayesian Belief Network (BBN) technology. The module has specialist HCI that is focused on completing, validating and executing a BBN. Once this has been completed the network can be exported to the Ontology. Similarly data from Ontology can be imported.
i) Integration (Indra) – This module deals with the low level software architecture of SCIIMS. SCIIMS has a Service Orientated Architecture (SOA). To support this architecture an Enterprise Service Bus (ESB) is used which enables existing modules and applications to be implemented as services. There are several processes which can have long execution times such as the web crawler and data mining therefore SCIIMS has an asynchronous architecture.
j) Shared Information (Indra) – This module allows intelligence reports to be created and distributed via email.
k) Alerts (BAE Systems) – As an analyst works on an investigation there may be changes to or additional information upon which the investigation is based. This module allows an analyst to specify the alert criteria for changes to the data in Ontology and will generate the alerts accordingly using an email system.
l) Entity Editor (BAE Systems) – This allows a user to navigate the Ontology and provides direct read and write access to its contents. (a graphical structure is displayed). It allows an investigator to have another means of accessing the information in the Ontology.
m) Synchroniser (BAE Systems) – This updates the Ontology with data from the Relational Databases.
Organising a Criminal Investigation
Consideration has been given by BAE Systems to the problems of providing computer-assistance to an Investigator, perhaps operating in a team and needing to justify any actionable findings. A number of techniques have been identified and combined into a conceptual system of thinking. This influenced the System Design and several modules, especially ontology, HCI and ACH/BBN.
The originating techniques are:
a) Pirolli-Card (P-C) – a cognitive model that shows how an investigator typically conducts an Investigation
b) Goal Structured Notation (GSN) – a general information-representation technique that can be used in SCIIMS to show a hierarchy of findings, linking evidence to actionable conclusions
c) Assessment of Competing Hypotheses (ACH) – an investigative technique that helps to avoid making premature conclusions. The Investigator is supported in maintaining several hypotheses for explain some finding, only excluding those explanations when evidence contradicts them.
d) Bayesian Belief Networks (BBN) – a probabilistic technique used to reason with a system of related hypotheses. In SCIIMS it is used to process the hypotheses of an ACH when the ACH would otherwise be too hard to consider mentally. A problem is to minimise the commonly-perceived overhead of using a BBN.
It is important to avoid overkill. Operational feedback is that most investigations are rather simple, with an obvious link between the evidence and the conclusions. In SCIIMS, it is possible to conduct a simple investigation, and then grow it to use GSN, ACH and BBN techniques as the need arises. The Netica tool is used to construct graphically Bayesian Belief networks as necessary to support an investigation.
The Ontology (produced by Selex with input from BAE Systems) has these major sections:
a) Operational - aligned with the Adversaries Model and containing persons places, relationships, events etc.
b) Process – for the Pirolli and Card business process, user identification etc.
c) Investigation – for shoebox data, searches, alerts etc.
A SCIIMS Ontology Specification document has been produced by BAE Systems. Innovative and beyond state of the art are:
a) Having such a direct link from the system understanding of SCIIMS into the formal definition of the ontology.
b) Extending the scope of the ontology beyond just the operational domain.
The programme has used OWLIM-Lite for performance and cost reasons and also to allow a database reasoner to be used. The reasoner allowed the significant simplification of some of the module code.
A structured investigation can be created using entities and the associations within the ontology for hypotheses, evidence file items, shoebox items etc.
The HCI Design Document written by DFI describes the HCI design concepts, the implementation of an analyst’s business processes and also includes screen examples.
There are some beyond the state of the art features of the HCI including:
a) Explicit use of an analyst’s business process (for SCIIMS this is Pirolli and Card).
b) The use of the Ontology to define the HCI for the Schema Viewer (the viewer is used to graphically show the hypotheses and evidence for an investigation). One main advantage of this is that this HCI is more generic. The HCI can be modified or configured by updating the Ontology without the need to change the HCI code (in fact this provided beneficial when enhancing the Prototype/Demonstration System). The ontology also provides common semantics across the system and therefore for this part of the HCI.
All of the modules described previously have been developed by the consortium members and integrated by Indra (in accordance with an Integration plan and Configuration Management plan both written by Indra) into a working Prototype/Demonstration System.
Installation instructions and basic tests to prove the module works have been written for each module by the consortium members. An overall set of installation instructions has been written for the Prototype/Demonstration system by Indra
Indra maintained and published an Integration Test Document which identifies the tests carried out and any issues identified.
A video showing the features of SCIIMS from an integration point of view of the working /integrated Prototype/Demonstration system has been produced by Indra (Annex I delivery 09).
To aid demonstrations and experiments it is possible to run the Prototype/Demonstration System in a virtual environment on a laptop.
Demonstrations and Experiments
Overall Approach to System Test, Experimentation and Demonstrations
A System Test, Experimentation and Demonstration (STED) Plan was written by BAE Systems to detail the approach and the work needed to prepare, conduct, analysis and report for system tests, demonstrations and experiments. It specifies at a high level the test data required. It also includes guidance for possible experimentation tests and experimental considerations of Measures of Merit (MoM) and Measures of Performance (MoP). The overall concept was to produce a test script based on the Adversaries Model Scenario and use it as a basis for the System Tests, Demonstrations (which use a subset of the tests), and system level experimentation (where “human in the loop” experiments are carried out). This combined approach is more efficient than having separate sets of tests and associated test data, but is effective for proving and assessing the capabilities of SCIIMS. The Adversaries Model also defines in detail the test data for the Prototype/Demonstration System relational databases, ontology and web pages.
The Prototype/Demonstration system relational databases and Ontology were set up by Indra and BAE Systems respectively with the criminal traces specified by the Adversaries Model, as well as noise data and innocent occurrences.
Dummy social (Facebook) and advertisement websites were created by the University of A’Coruna which were modelled upon the Adversaries Model criminal traces. These allow web crawling to be demonstrated without the need to be connected to the internet.
Adversaries Model Scenario Tests
There are three stages to the Adversaries model investigation
a) 1a - Is people trafficking a problem in SCIIMS Town, London?
b) 1b - What are the characteristics of the sex trafficking problem?
c) 1c - Who are the culprits?
The Pirolli and Card model identifies Foraging (for gathering information) and Sense Making stages. Separate test scripts were written by Indra and BAE Systems to cover phases 1a, 1b and 1c with different scripts covering the foraging parts (e.g. web crawling and data mining) and the sense making. Additional files were produced as necessary (e.g. N3 - for Ontology Updates, DNE for Bayesian Belief networks BBN.Txt-SPARQL)
These scripts were through showing that the Prototype/Demonstration system could be used to carry out an investigation into people trafficking. A video (more than 3 hours long) has been produced of this. A forty five minute edited version is available showing the main capabilities. Both videos were produced by BAE Systems.
System tests were carried out so confirm the system requirements have been met.
The main tests used for this are the Adversaries Model scenario tests. However these were not sufficient for some of the requirements. Therefore existing module (“shoebox”), installation tests or integration tests were used to prove these requirements.
Some of the System Requirements are design requirements. These were checked by confirming that the System Design met these requirements.
A Verification Cross Reference Index (VCRI) initially produced by Indra maps the system requirements to the tests, test results, and configuration as well as providing some other information about the tests.
The SCIIMS Prototype/demonstration system has undergone a number of enhancements to improve its capability to be used to carry out an investigation. Therefore the tests shown in the VCRI have been run at different times and using different versions of the system. Nevertheless the tests run and the results give a very good indication of which requirements have been successfully implemented.
A number of system requirements have not been implemented because they are not beyond state of the art or a research aim (the main ones are Named Entity Resolution (NER) and security).
The VCRI and an analysis of the results of the system test are documented in the SCIIMS Test Report (delivery No 10 Annex I) written by BAE Systems.
The SCIIMS Prototype/Demonstration system was demonstrated using a mixture of videos and by engaging the relevant experts
A questionnaire was used to record the feedback which was subsequently analysed. Questions are asked about each capability and SCIIMS as a system. Generally there has been a lack of user group feedback throughout the project duration. Therefore to help overcome this issue the questionnaire includes for each capability the question “How does this compare to your existing system?”
Feedback has been received from the NBI and three subject matter experts. The participation in the demonstrations was not as wide ranging as hoped for but despite this useful feedback was received.
The results showed that overall all of the capabilities were useful, but that the system suffered from a lack of maturity, especially regarding the HCI. This was to be expected from a demonstration system.
The most useful capabilities were the Alerts, Graphical viewer, ACH statistical, Data integration and the Web crawler.
The responses to the comparison to the existing system question showed that some capabilities were not in existing systems; this was particularly true of the schema viewer and ACH statistical features. However if a comparable capability was available in an existing system the results were mixed.
The demonstration results and an analysis of them is documented in the Demonstration Report (delivery No 12 Annex I) written by BAE Systems.
There are two types of experiments:
a) “Shoe box” experiments (i.e. module level experiments) such as those already being carried out for the Web Crawling and Data Mining Research
b) System level experiments, which use all, or most, of the SCIIMS System e.g. “human in the loop” experiments
The “Shoe box” experiments have been carried out for the Web Crawling and Data Mining Research respectively. These are documented in the Mathematical Models and Proof of Concept volumes.
Due to the general lack of feedback from the user groups and time constraints it was not possible to establish existing baseline capabilities for the system level experiments. Therefore it was decided that the experimentation will be “human in the loop” experiments using user groups, subject matter experts or non-SCIIMS engineers where the participants would run through a subset of the Adversaries Model script and answer a questions in a detailed questionnaire.
The Questionnaire written by Denodo asked seven standard questions (each with a numerical score) about each a capability and SCIIMS as a system. The questions covered usefulness, ease of use, effectiveness, and implementation how it compares to existing systems. The last question was asked to overcome the lack of baseline information from user groups. The questionnaire also included comment boxes for addition information from the participants.
An Experimentation Script was written by Denodo consisting of a set of instructions for the participants to perform the experiments. These informed the participant which steps of the Adversaries Model script to run and when to answer particular questions.
A set of training slide was produced by Denodo explaining how to use the Prototype/Demonstration system so participants could use it unaided to carry out the experiments.
System level experiments were conducted by BAE Systems, Denodo and the Unversity of A’Coruna using a subject matter expert from Detica and three non-SCIIMS engineers. The participation in the experiments was by no means as wide ranging as hoped for but despite this some useful feedback was received.
The conclusions reached were broadly that SCIIMS could be a very useful system, but needs an implementation improvement. This is to be expected for a prototype/demonstration system.
The feedback from the participants is summarised as follows:
a) Web Crawler - This feature scored very well in all categories. The results were considered relevant, but executing the crawler a bit repetitive. It was suggested that the categories for each web page should be preloaded and, during the crawling, not all URLs must be seen.
b) Graphical SPARQL Query Tool - The information provided by this tool is considered useful. The interface and terms of querying are considered a bit complex and the integration with the rest of the system should be improved.
c) Data Integration - This feature scored well in almost all the categories, but not so well in “implementation”.
d) Classification of Advertisements - This feature is considered useful. The participants would like to see the process more graphically and the interface could be improved. They think that the system to evaluate advertisements is confusing, maybe it would be better to use the scale “low”, “medium” and “high” to determine how suspicious an advertisement is.
e) Graphical Viewer - This feature scored very well in all the categories. It is considered very intuitive and easy to learn.
f) Schema Viewer - This feature scored well in almost all the capabilities but it is not considered very easy to use or implemented. It is difficult to perform some tasks and there could be problems when there is a lot of data to display. The information is easy to change and the visualisation is fine. It was suggested it could be useful to see different investigations at the same time.
g) ACH Statistical (BBN) - This feature scored well in almost all the capabilities, but not in effectiveness and implementation. It would need separate training to use and it is very standalone and needs to be more integrated. The SME reported that Netica does not appear to be more powerful than Excel files, at least in these test cases.
h) Alerts - This feature has been considered useful, effective, easy to use, the information provided useful and well implemented.
i) Reports - This feature has been considered useful and the information provided also useful.
j) SCIIMS as a system - SCIIMS as a system scored well in all categories apart from “easy to use” and “implementation”. The participants reported that in its current state the different tools are not hundred per cent integrated and that some User Interface options are missing. However the architecture is sound and the algorithms are powerful.
k) The Classification of Advertisements, Schema Viewer and ACH Statistical do not exist in their current systems
The experimentation results and a fuller analysis of them is documented in the Experimentation Analysis Report (delivery No 13 Annex I) written by Denodo.
Innovations/Beyond State of the Art
The SCIIMS Project has resulted in a number of innovations and beyond stated of the art developments. These are summarised as follows and include:
a) Web Crawling
i) Automatic (unsupervised) Web Crawling data extraction. An improved algorithm has been produced. With respect to previous systems in the literature, this new algorithm is able to deal with pages containing more complex data structures such as nested sub-lists, and it includes more advanced techniques for detecting and removing irrelevant data. These enhancements were needed to deal with the types of websites needed by the SCIIMS project, such as social networks websites, online classified advertisements, and job websites.
ii) Automatic Web Navigation Performance. Improved Algorithms for more efficient execution of automated web navigation sequences.
b) Data Mining.
i) Entity Resolution Algorithms which use new methods of extracting useful information from large datasets
ii) Improved and new tools for visual analytics. The collected data is represented as a network of heterogeneous entities. This representation is a new way of using collected law-enforcement related data.
c) Data virtualization technology for RDF / OWL
d) Ontology. Innovative and beyond state of the art are:
i) Extending the scope of the ontology beyond just the operational domain. (to include the business process, HCI)
ii) The inclusion of a means of describing information that has uncertainty. It is necessary to include statements about the Operational World that are uncertain, typically through lack of concrete evidence.
e) The use of Goal Structured Notation (GSN) and probabilistic information to represent an investigation.
i) GSN is used to represent an investigation to show a hierarchy of findings, linking evidence to actionable conclusions
ii) In conventional investigations, an item of evidence is assessed by rating the credibility of the source and the credibility of the claim made by the source. This is retained...An innovation is to introduce another rating scheme for the hypotheses that the Analyst defines as part of the Investigation. The ratings are for the probability that the hypothesis is true, and the quality of evidence that supports the stated probability.
f) Information Management Tools.. Tools, such as the ACH Module and Alerts, have been produced that exploit the reasoner and search features of the ontology. They are therefore simpler and more generic.
g) Adversaries Model. System Architecting has been used to capture our understanding of the people trafficking trade, with an example of a crime and how SCIIMS would be used to investigate it. The output from this System Architecting is the Adversaries Model. Innovations are that this model:
i) Specifies in detail the test data for demonstrations and experiments
ii) Provides an input for the specification of the Ontology."
h) HCI There are some beyond the state of the art features of the HCI including:
i) Explicit use of an analyst’s business process (for SCIIMS this is Pirolli and Card).
ii) The use of the Ontology to define the HCI for the Schema Viewer (the viewer is used to graphically show the hypotheses and evidence for an investigation). The HCI can be modified or configured by updating the Ontology without the need to change the HCI code.
The majority of these innovations and beyond state of the art developments are described in the SCIIMS Innovation Report written by BAE Systems.
Potential Impact and societal implications
The expected impact of the FP7 Security Work Programme from the original call was:
Actions in this area will provide the adapted technology basis and relevant knowledge for security capabilities needed in this and also other mission(s), as required by integrating industry and/or (private and/or public) end users, while achieving a significant improvement with respect to performance, reliability, speed and cost. At the same time, actions will reflect the mutual dependency of technology, organisational dynamics, human factors, societal issues as well as related legal aspects. This will reinforce European industry’s potential to create important market opportunities and establish leadership, and it will ensure sufficient awareness and understanding of all relevant issues for the take up of their outcome (e.g. regarding harmonisation and standardisation, potential classification requirements, international cooperation needs, communication strategies etc.) as well as for further research needs with a view to future security work programmes.
Initial research conducted under the SCIIMS programme has confirmed that people trafficking and people smuggling is a very real and pressing problem both globally and within Europe. The aim of SCIIMS was to identify and develop capabilities that will assist European agencies to target the problem effectively.
SCIIMS has achieved this through a research, capability development, and experimentation programme, investigating both existing technologies and those being researched and developed. Much of the required existing technologies were provided by consortium members in the areas of Information Management/Information Exploitation (IM/IX) including intelligence analysis, passenger tracking, Human Computer Interfaces (HCI), data fusion, data integration, data mining, decision tools and web extraction.
SCIIMS allowed a team of approximately twenty people to carry out research, investigate and develop new technologies, or apply technologies in a novel way over a period of three years. The future economic benefits are hard to assess but the market analysis in Exploitation Plan and Route Map (delivery No 7 Annex I) showed there is a potential IM/IX market in the billions of Euros that is accessible to the SCIIMS related technologies. If SCIIMS was further developed it may provide employment in the order of several tens of people over a period of three to five years.
The Ethics Advisory Board (EAB) has advised the project to work in accordance with the basic legal norm of Directive 95/46/EC of the European Parliament and of the Council dated 24 October 1995 for data protection. This is considered to be the most demanding of legal obligations among the different national legislations of the countries represented within the consortium. The final report from the EAB stated that the SCIIMS project had respected and satisfied the ethical principles for the use of data. They also indicated that the legal and ethical issues for data do not preclude the use of a future deployed SCIIMS system.
The FP 7 funded SCIIMS Project has provided a valuable “free space” for carrying out research and investigating different and original technologies. Examples of this include the use of the Ontology within SCIIMS and web crawling research. By having a mixture of academia and small medium enterprises from across Europe it has increased the awareness of the IM/IX capabilities of other countries and their technologies. It has demonstrated how research work carried out by academia can be directly used. It has also shown the value of a pan-European team cooperating with each other to achieve a common goal.
The SCIIMS Prototype/Demonstration system has integrated capabilities both for foraging for data and for Sense Making (which allows investigator to select data and structure it in a logical way to support an investigation). Most existing systems have concentrated on foraging. Therefore the impact if SCIIMS is taken forward together with some enhancements particularly for information sharing and security (these are discussed in detail in the Exploitation Plan and Route Map (delivery No 7 Annex I)) into a product would be an improvement in investigating people trafficking and people smuggling and, ultimately, a reduction in these crimes. SCIIMS could be easily modified to cover other crime areas. Feedback from user groups has indicated that information is not shared or there is a reluctance to share data. A set of deployed SCIIMS systems with the technology to enable controlled secure information sharing could lead to a technology led change in the culture for information sharing leading to increased cooperation between crime agencies
BAE Systems has designed and created a SCIIMS Website at www.sciims.eu. This website has been maintained. At the end of the project a short SCIIMS promotional video was produced by BAE Systems explaining the capabilities and work carried out, as well as ideas for future research. This can be accessed or down loaded from the website.
The website prompted communication from Idox who were commissioned by Enterprise Ireland to put together a series of case studies on the most interesting projects to have been supported by FP7 in Ireland. Both BAE Systems and DFI provided information to Idox and SCIIMS was included in FP7 Ireland Website www.fp7ireland.comsecurity of citizens catalogue.
Newsletter and Poster
A SCIIMS newsletter was produced by BAE Systems with contributions from all consortium members. It highlights the latest project news, research and achievements, as well as providing general information about the SCIIMS Programme. This has been distributed to the consortium members for onward transmission to user groups and other interested parties.
BAE Systems designed and produced an A0 size SCIIMS Poster. The poster shows the SCIIMS design considerations, highlights beyond state of the art areas and includes the development process and possible future exploitation.
Sztaki represented SCIIMS at the SRC 2011 conference in Warsaw. This included a poster session where the aforementioned SCIIMS poster was displayed.
BAE Systems gave a short presentation regarding our experiences with ethics during the SCIIMS Programme at the Ethical Issues in Security Research workshop held in Brussels.
Videos of the Prototype/Demonstration system have also been produced. These videos are:
a) A video produced by Indra showing how the various SCIIMS components work.
b) A three hour video and a forty five minute summarised version showing the SCIIMS Prototype/Demonstration System being used to investigate the Adversaries Model people trafficking scenario.
Collaboration with other FP 7 Projects
At the Mid Term Review, REA recommended that the SCIIMS project contacted the VIRTUOSO and CAPER projects with the view of sharing information. Contact was made with the coordinators of both projects asking if they were interested and the CAPER project expressed an interest in future discussions. However IPR issues were a barrier.
The project has informed the FP 7 projects TELEIOS (which is producing a virtual observatory infrastructure for managing large amounts of satellite observation data), RECOBIA (researching the reduction of cognitive biases in intelligence analysis) about the work that has been carried out on SCIIMS as recommended at the Final Review.
Exploitation Plan and Route Map
An Exploitation Plan and Route Map (delivery No 7 Annex I) has been produced. This in an IM/IX context covers market analysis and trends, user group priorities, future technologies and products, consortium member exploitation member plans, provides a road map for research, capabilities and technologies up to 2016, and also a Quality Function Deployment Analysis (QFD) showing the relative contribution of SCIIMS technologies to the user capabilities. It also covers market opportunities for business intelligence, trans-national information exchange. A number of potential areas for further development post SCIIMS have been identified during the research, design and development activities of SCIIMS.
The ethical and legal issues of the deployed system (also included in the exploitation plan) was assessed by the EAB and their conclusion was as follows:
“This analysis shows that the legal and ethical issues covering Personal Data, Individual Rights, Data Segregation and Data Security do not preclude the use of a deployed SCIIMS system, given that all the aforementioned regulations are respected.”
There are a number of possibilities for exploiting SCIIMS including:
a) Use of the IM/IX technology (data mining, data fusion and the assessment of competing hypotheses etc.).
b) Use of the criminal-domain ontology as a basis for a family of IM/IX products addressing the criminal domain.
c) Applying Ontologies and associated tools to other areas
d) Developing SCIIMS into a product
e) Dissemination of the Data Mining and Web Crawling research to the research community through journal publications and conferences.
f) A follow on FP7 SCIIMS Integration Project
g) Participation in other related FP7 projects
Consortium Member Exploitation
The consortium members are planning to exploit SCIIMS in a number of ways. The following section describes a number of these.
Denodo have incorporated the SCIIMS capabilities for handling RDF/Ontology contents into the Denodo Data Virtualization Platform. They plan to exploit this by:
a) Providing data services layers for related projects through Denodo´s network of partners in the Intelligence and Home Security space.
b) Exploiting the results of the project in other solution areas the company is working on which will benefit from the new capabilities obtained as a result of the project:
i) Single View of Entities (Customer, Patient, Product, …) solutions
ii) Technology Watch and Business Intelligence in commercial and industrial sectors
iii) Real-time Balanced Scorecards for medium/large corporations.
Sztaki and the University of A’Coruna have and are continuing to disseminate Data Mining and Web Crawling research respectively through the publication of papers and at conferences.
BAE Systems has plans to include SCIIMS in an exhibition both to senior management within the company and to potential customers.
Indra is leading a submission for “SCIIMS 2” which is an FP7 integration project for “Big Data”. The majority of the existing SCIIMS consortium members are participating in this.
List of Websites:
Grant agreement ID: 218223
1 November 2009
31 October 2012
€ 3 595 562,80
€ 2 318 996,45
BAE SYSTEMS INTEGRATED SYSTEM TECHNOLOGIES LTD
Deliverables not available
Grant agreement ID: 218223
1 November 2009
31 October 2012
€ 3 595 562,80
€ 2 318 996,45
BAE SYSTEMS INTEGRATED SYSTEM TECHNOLOGIES LTD
Grant agreement ID: 218223
1 November 2009
31 October 2012
€ 3 595 562,80
€ 2 318 996,45
BAE SYSTEMS INTEGRATED SYSTEM TECHNOLOGIES LTD