Skip to main content
Ir a la página de inicio de la Comisión Europea (se abrirá en una nueva ventana)
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS

Sustainable Data Lakes for Extreme-Scale Analytics

Periodic Reporting for period 2 - SmartDataLake (Sustainable Data Lakes for Extreme-Scale Analytics)

Período documentado: 2020-07-01 hasta 2021-12-31

Nowadays, both large enterprises and SMEs are becoming increasingly data-driven and data-intensive, relying on data and analytics throughout the whole fabric of their business (strategic planning, sales, marketing, finance, operations) to make fact-based business decisions and to better analyse and understand business conditions. Faced with a multitude of data sources, the traditional approach is to squeeze the data into a data warehouse. This requires extensive Extract-Transform-Load (ETL) processes to filter, aggregate and transform data from the original sources to a target data store. It imposes a predefined format, schema and storage for the target data and, accordingly, a predefined set of rules for data ingestion. This is not flexible for investigating new sources and accommodating changes in existing ones. Moreover, ETL processes often take several hours to complete, introducing long waiting times and overhead between the points when new data becomes available and when data scientists can query and analyse it.

Responding to these needs of an open and dynamic world, data lakes have emerged as an alternative approach. A data lake is a raw data ecosystem, where large amounts of diverse structured, unstructured and semi-structured data in its natural format and in various models can coexist. A data lake retains all data, including data that is kept because it might be of use at some point in the future, as opposed to predefined parts of data at predefined levels of granularity that are known in advance to serve specific purposes. Data is retained in its natural, raw form, following a "schema on read" rather than "schema on write" approach, and it is transformed only when the use for it arises. As opposed to data warehouses that can efficiently serve well-planned and anticipated business needs and operations, data lakes are the go-to place for so-called self-service analytics. Data scientists can directly tap into the data lake to analyse data from new sources, combine data of different types, come up with new business questions, test hypotheses and derive new insights and knowledge, improving flexible, fast and ad hoc decision making.

To this end, the overall goal of the SmartDataLake project is to design, develop and evaluate a novel framework for supporting extreme-scale analytics over Big Data Lakes, thus facilitating the journey from raw data to actionable insights. In particular, it offers a suite of tools for virtualized and adaptive data access; automated and adaptive data storage tiering; smart data discovery, exploration and mining; monitoring and assessing the impact of changes; and empowering the data scientist in the loop through scalable and interactive data visualizations.
SmartDataLake has been successfully completed, fully achieving all its objectives. The main results of the project can be summarized as follows:

- We have delivered a data virtualization layer, by integrating the Proteus query engine and the commercial system RAW. We have added features to these systems to enable direct access to heterogeneous data, including different data formats and models, and allowing queries to run directly on native data. We have also developed an adaptive, self-optimizing, Query Approximation Layer (QAL), which is built on top of the data virtualization layer, helping to speed up big data analytics by providing fast, approximate answers.

- We support accessing data from local storage or from different storage locations like cloud storage. Our automated and adaptive data storage tiering allows to store local data efficiently and to appropriately cache data accessed from cloud storage. We have developed a data tiering architecture that takes into account the frequency of data access and the properties of data storage available in the storage tier when automatically allocating data to different storage tiers.

- We have implemented functionalities for linking, exploring and analysing different types of entities in heterogeneous information networks (HINs). We support various similarity join operators, as well as aggregate top-k similarity queries over textual, numerical and geospatial attributes. Moreover, we have implemented an entity resolution component that can identify and link nodes in a HIN that represent the same real-world entity. We also support ranking of entities represented as nodes in a HIN, as well as suggesting new links between entities and detecting overlapping or hierarchical communities of entities.

- We have implemented services for monitoring changes in the underlying data. These include functionalities for seasonality and change point detection in time series, as well as the modelling and tracking of evolution in groups of entities that change over time. In addition, we have designed algorithms for the dynamic maintenance of graph pattern queries, offering quick insights about what has changed in terms of the graph pattern whenever the data is updated.

- We have defined a Visual Analytics model, which specifies the interfaces and the interplay between automated components of the SmartDataLake analysis pipeline and human sense-making. This includes a Visual Analytics engine, interfacing directly with the lower-level components of SmartDataLake, and the Visual Explorer, a front-end application that supports the human analyst by providing visualizations and interactions in a scalable and comprehensive user-interface.

Overall, our work in the project has resulted to more than 35 scientific publications and 11 open source software components.
SmartDataLake empowers data scientists to perform extreme-scale analytics over data lakes, reducing the time and effort required for extracting insights from raw data. The project has delivered novel results that go beyond the state of the art, including:
- Query planning and optimization over virtualized data, enabling holistic optimizations across different data models and formats, powered by different types of adaptive access paths.
- Approximate query processing techniques supported by different types of dynamically and adaptively constructed data summaries accompanied with theoretically proven probabilistic guarantees for approximation errors.
- Automated and adaptive data placement across different storage tiers, enabling different pricing/performance trade-offs.
- Scalable algorithms for multi-criteria attribute-based and link-based similarity search and exploration for multi-faceted entities in heterogeneous information networks.
- Scalable techniques for entity resolution in heterogeneous information networks, enhanced with multi-criteria attribute-based and link-based entity ranking.
- Link prediction and community detection algorithms for heterogeneous information networks exploiting attribute-based and path-based features.
- Algorithms for detecting and incrementally adapting to changes in newly collected data.
- Model for interactive and multi-faceted visual analytics focusing on guiding the analyst's attention to the most interesting and relevant findings.
- Scalable and interactive visual analytics techniques tailored to different types of data, including geospatial, time series, and graph data.
smartdatalake-logo.png
Mi folleto 0 0