Skip to main content

A Scalable and Elastic Platform for Near-Realtime Analytics for The Graph of Everything

Periodic Reporting for period 1 - SMARTER (A Scalable and Elastic Platform for Near-Realtime Analytics for The Graph of Everything)

Reporting period: 2016-04-20 to 2018-04-19

The SMARTER project carried out the research on how to derive actionable information from enormous amount of data generated by the Internet of Everything to leverage data-driven strategies to innovate, compete, and capture value from deep web and real-time information. The project targeted innovative research outcomes by addressing Big Dynamic Data Analytic requirements from three relevant aspects: variety and velocity and volume. The project introduced the concept, “Graph of Everything” (GoT), to deal with the issue of data variety in data analytics for Internet of Things (IoT) data. The Graph of Everything extends RDF Data model, that has been widely used for representing deep web data, to connect dynamic data from data streams generated from IoT, e.g. sensor readings, with any knowledgebase to create a single graph as an integrated database serving any analytical queries on a set of nodes/edges of the graph, so called, analytical lens of everything. The dynamic data represented as RDF data stream, or Semantic Stream in general, may contain valuable, but perishable insights which are only valuable if it can be detected to act on them right at the right time. Moreover, to derive such insights, the dynamic data needs to be correlated with various large datasets. Therefore, SMARTER has to deal both the velocity requirements together volume requirements of analysing GoT.
For the dealing with fast update streams in conjunction with massive volume of data, the project investigated on how to create a native and adaptive solution for storing and processing data of the Graph of Everything on the cloud infrastructure. Whereas conventional relational database infrastructures are crumbling under the volume of this data with the tight constraints on schemata, the SMARTER relied on a novel graph-based computing paradigm to a massively parallel processing architecture power by relying on elastic computing platforms, e.g Apache Flink. Furthermore, with the RDF-based model with places no conditions on the structure of the data that can be processed, the project's solution supports the pre-computing of slow-updated data for continuous queries over highly-update data from streams which proved a huge performance gain in the extensive experiments on the standalone environment . Velocity reflects the need for “just-in-time” processing of dynamic stream data i.e. incoming stream data flows into processing pipelines that must be matched to patterns in the static data.
Firstly, the project collected requirements to build the abstract data model and data processing operations as well as to generalise stream-based data analytical processing pipelines for different domains. Then the theoretical and experimental analysis of existing approaches and techniques are conducted to identify the technical shortcomings and research problems in dealing with data queries abstracted and generalised. The project used RDF model as the basic abstraction to model all kind of data collected from the heterogeneous data sources. Each data item is represented as graph nodes and graph edges that form The Graph of Everything (GoT). GoT also includes description about software process and learning processing, predicting models that process sensor readings or observations/events generated from interconnected things. On top of that, GOT links the reasoning rules that represent semantic relationships among things and facts about them. To process the RDF Stream data in provided form, the project proposed an execution framework that is abstract and flexible enough to plug a new component and implement a new algorithm. In the backend, to make SMARTER elastically scalable, a native execution framework is built by extending my previous built CQELS engine with Apache Flink. . Project also investigated how to integrate machine learning operations provided by built-in components of Apache Flink and DeepLearning4J. On the other hand, the project also studied on improving scalability for edge devices. The front end of SMARTER prototyped some visual analytics tools to enable diverse analytical tasks. The outcome these tasks are reflected in 17 publications within 2 years.
As on the Web, access to and integration of information from large numbers of heterogeneous streaming sources under diverse ownership and control is a resource-intensive and cumbersome task without proper support. Such streaming data sources generated from data acquisition infrastructures of Smart Cities, Social network application, medical sensors, etc are still dominated by static data silos, large warehousing systems that are often unmanageable and rudimentary user interaction interfaces which trade intuitiveness for completeness. The effective exploitation of continuous data streams from multiple sources requires an infrastructure that supports the intense effort of enrichment, linkage, and correlation of data stream with very large static data collections while at the same time combining the result into increasingly complex data objects representative of realistic models of the world. When these data objects are represented as graphs or networks (represented as sets of triples), complex processing pipelines can be described and implemented by combining the expressive power of declarative queries and high-level data-parallel programming paradigms which also rely on elastic execution leveraging a cloud infrastructure. Linked Stream Data employs Linked Data Model to provide graph as the basic representation for stream data.
To answer to this need, this project carried out studies to design algorithms and provide a solution based on RDF data model for leveraging provide graph as the basic representation for stream data while meeting the scalable and low-latency processing requirements on big volume and highly dynamic data. The Graph of Everything is the graph representation of data generated from Internet of Everything that is referred as an extension of the Internet of Things (IoT), where smart devices, personal clouds, wearable technology, big data, and networking are all inter-connected to each other via the internet. The solution provides a declarative mechanism to filter, aggregate, enrich, and analyze a high throughput of data from multiple disparate live data sources and in any data format to identify simple and complex patterns to visualize business in real-time, detect urgent situations, and automate immediate actions. In SMARTER, the unified view of all the heterogeneous data sources provided by RDF data model facilitates the transparent and cost-effective integration of analytical computing components based on graph-based data model. For dealing with heterogeneity of dynamic data sources, the project proposed an approach for transforming the data in a variety of data formats and data sources to make it ready to any further processing and high-level analytical operations. For scaling, the project investigated on a workflow-based solution coupled with a declarative continuous query language over Linked Stream Data so that the process of creating a data analytic pipeline can be done interactively and visually. Along with contribution of elastically scaling the continuous analytical processing the cloud infrastructure, the project also will aim to make new contribution on adaptive optimisation of multiple continuous queries deployed in distributed stream processing instances
SMARTER architecture for the Graph of Things