European Commission logo
English English
CORDIS - EU research results

Enabling Data Enrichment Pipelines for AI-driven Business Products and Services

Periodic Reporting for period 1 - enRichMyData (Enabling Data Enrichment Pipelines for AI-driven Business Products and Services)

Reporting period: 2022-10-01 to 2023-09-30

High-quality, rich and meaningful data are crucial to the successful implementation of Artificial Intelligence (AI) and Big Data Analytics (BDA) solutions. The process of delivering required data to feed into AI and BDA models is costly, difficult, and often limited in terms of data and skill availability. It is well-known that up to 80% of the effort spent in AI and BDA projects is dedicated to ensuring data is fit for purpose. Activities are required to discover, understand, select, clean, transform, integrate data from a variety of sources in such a way that data can be fed into the modeling phase. Such activities result in enriched data, that eventually improve the quality of downstream BDA and AI applications. The data enrichment process is implemented by specifying, deploying, and executing data enrichment pipelines over data that can be structured, semi-structured and unstructured, in large amounts, and from static or streaming sources. While techniques exist to cover different enrichment operations such as data cleaning, linking, feature extraction, classification and semantic annotation, etc., the lack of comprehensive approaches and established tools dedicated to data enrichment makes the definition, implementation, and operation of enrichment pipelines difficult for too many organizations willing to improve their BDA and AI applications.

The objective of enRichMyData is to develop an open software toolbox – the enRichMyData toolbox – comprising practical, robust and scalable components to support organizations in enriching their data with reference data they may have limited knowledge of, as well as supporting data providers in making their data reusable and available in data enrichment processes. The aim of the toolbox is to lower the technological entry barriers by providing support for the definition of highly scalable and replicable data enrichment pipelines through a set of tools and infrastructure services related to capabilities needed during the lifecycle of enrichment pipelines. The toolbox will make the data enrichment process accessible to a wider set of stakeholders by reducing the level of expertise required and enhancing the level of tool support.
For Period 1, the actions towards meeting this objective have been mostly centred around the implementation of the enRichMyData toolbox, focusing on an extensive, structured, and diligent process of gathering requirements.
The process consisted of an interview-based analysis of the Business Cases, paying close attention to current data management processes as a baseline for the improvements and innovations that will be enabled by taking into use the enRichMyData toolbox. The business case requirements have been documented in Deliverable D4.1.
In addition, an extensive analysis of the state of the art in research on data enrichment, as well as an overview of the state of practice regarding tools for data enrichment both within and outside the enRichMyData consortium. The results of these analyses are documented in Deliverable D4.1.
Based on this work, initial versions of the enRichMyData tools and the toolbox have been designed and implemented, resulting in Deliverables D2.1 and D3.1.
The enRichMyData Toolbox is intended to be a flexible and easy to use framework comprising practical, robust and scalable components to support organizations in enriching their data with reference data they may have limited knowledge of, as well as supporting data providers in making their data reusable and available in data enrichment processes. The categories of tools part of the toolbox include DiscoverR, ResourcR, WrappR, CleanR, LinkR, StructR, ClassifiR, StreamR, GreenR, ScalR and ReusR.
Project logo