Skip to main content

Sustainable Data Lakes for Extreme-Scale Analytics

Periodic Reporting for period 1 - SmartDataLake (Sustainable Data Lakes for Extreme-Scale Analytics)

Reporting period: 2019-01-01 to 2020-06-30

"Modern enterprises are becoming increasingly data-driven and data-intensive, relying on data and analytics throughout the whole fabric of their business (strategic planning, sales, marketing, finance, operations) to make fact-based business decisions and to better analyse and understand business conditions. Faced with a multitude of data sources, the traditional approach is to squeeze the data into a data warehouse. This requires extensive Extract-Transform-Load (ETL) processes to filter, aggregate and transform data from the original sources to a target data store. It imposes a predefined format, schema and storage for the target data and, accordingly, a predefined set of rules for data ingestion. This is not flexible for investigating new sources and accommodating changes in existing ones. Moreover, ETL processes often take several hours to complete, introducing long waiting times and overhead between the points when new data becomes available and when data scientists can query and analyse it.

Responding to these needs of an open and dynamic world, data lakes have emerged as an alternative approach. A data lake is a raw data ecosystem, where large amounts of diverse structured, unstructured and semi-structured data in its natural format and in various models can coexist. A data lake retains all data, including data that is kept because it might be of use at some point in the future, as opposed to predefined parts of data at predefined levels of granularity that are known in advance to serve specific purposes. Data is retained in its natural, raw form, following a ""schema on read"" rather than ""schema on write"" approach, and it is transformed only when the use for it arises. As opposed to data warehouses that can efficiently serve well-planned and anticipated business needs and operations, data lakes are the go-to place for so-called self-service analytics. Data scientists can directly tap into the data lake to analyse data from new sources, combine data of different types, come up with new business questions, test hypotheses and derive new insights and knowledge, improving flexible, fast and ad hoc decision making.

To this end, the overall goal of the SmartDataLake project is to design, develop and evaluate a novel framework for supporting extreme-scale analytics over Big Data Lakes, thus facilitating the journey from raw data to actionable insights. In particular, it offers a suite of tools for virtualized and adaptive data access; automated and adaptive data storage tiering; smart data discovery, exploration and mining; monitoring and assessing the impact of changes; and empowering the data scientist in the loop through scalable and interactive data visualizations."
SmartDataLake has successfully completed its first period, making progress in all objectives of the project. The main results achieved so far can be summarized as follows:
- We support data virtualization, by integrating the Proteus query engine and the commercial system RAW. We have added features to these systems to enable direct access to heterogeneous data, including different data formats and models, and allowing queries to run directly on native data.
- We have developed an adaptive, self-optimizing, Query Approximation Layer (QAL), which is built on top of the data virtualization layer, helping to speed up big data analytics by providing fast, approximate answers. The current version of QAL supports an extended SQL syntax that is more focused towards data exploration and visualization of highly dynamic heterogeneous data sets that typically reside in data lakes.
- We support accessing data from local storage or from different storage locations like cloud storage. Our automated and adaptive data storage tiering allows to store local data efficiently and to appropriately cache data accessed from cloud storage. We have developed a data tiering architecture that takes into account the frequency of data access and the properties of data storage available in the storage tier when automatically allocating data to different storage tiers.
- We have implemented functionalities for exploring diverse entities in heterogeneous information networks (HINs). We support various similarity join operators, as well as aggregate top-k similarity queries over textual, numerical and geospatial attributes. In addition, we enable discovering pairs of similar subsequences in time series data. Moreover, we have implemented an entity resolution component that can identify and link nodes in a HIN that represent the same real-world entity. We also support ranking of entities represented as nodes in a HIN, as well as detection of top-k geospatial regions according to user-defined scoring functions.
- We have defined a Visual Analytics model, which specifies the interfaces and the interplay between automated components of the SmartDataLake analysis pipeline and human sense-making. This includes a Visual Analytics engine, interfacing directly with the lower-level components of SmartDataLake, and the Visual Explorer, a front-end application that supports the human analyst by providing visualizations and interactions in a scalable and comprehensive user-interface.
SmartDataLake empowers data scientists to perform extreme-scale analytics over sustainable data lakes. The project delivers novel results that go beyond the state of the art in several directions, including:
- Advanced, parallel and distributed query planning and optimization over virtualized data, enabling holistic optimizations across different data models and formats, powered by different types of adaptive access paths, specialized indexes and an expressive query language.
- Parallel and distributed approximate query answering techniques supported by different types of dynamically and adaptively constructed and composed data summaries (samples, histograms, sketches) accompanied with theoretically proven probabilistic guarantees for approximation errors.
- Cost model for automated and adaptive data placement across different storage tiers, enabling different pricing/performance trade-offs, co-designed with novel techniques for efficient data analytics over cold storage.
- Scalable algorithms for multi-criteria attribute-based and link-based similarity search and exploration for multi-faceted entities in heterogeneous information networks.
- Scalable techniques for entity resolution in heterogeneous information networks, enhanced with multi-criteria attribute-based and link-based entity ranking.
- Link prediction and community detection algorithms for heterogeneous information networks exploiting attribute-based and path-based features.
- Algorithms for detecting and incrementally adapting to changes in newly collected data, efficiently updating entity attributes and relations in the information network, as well as any affected analysis results.
- Model for interactive and multi-faceted visual analytics focusing on guiding the analyst's attention to the most interesting and relevant findings, thus keeping her engaged in the task and increasing her productivity.
- Scalable and interactive visual analytics techniques tailored to specific types of data, including map-based visualizations for spatial data, time series visualizations for temporal data, and graph visualizations for network data, enhanced with visual interfaces for model and parameter selection and tuning.