High-quality, rich and meaningful data are crucial to the successful implementation of Artificial Intelligence (AI) and Big Data Analytics (BDA) solutions. The process of delivering required data to feed into AI and BDA models is costly, difficult, and often limited in terms of data and skill availability. It is well-known that up to 80% of the effort spent in AI and BDA projects is dedicated to ensuring data is fit for purpose. Activities are required to discover, understand, select, clean, transform, integrate data from a variety of sources in such a way that data can be fed into the modeling phase. Such activities result in enriched data, that eventually improve the quality of downstream BDA and AI applications. The data enrichment process is implemented by specifying, deploying, and executing data enrichment pipelines over data that can be structured, semi-structured and unstructured, in large amounts, and from static or streaming sources. While techniques exist to cover different enrichment operations such as data cleaning, linking, feature extraction, classification and semantic annotation, etc., the lack of comprehensive approaches and established tools dedicated to data enrichment makes the definition, implementation, and operation of enrichment pipelines difficult for too many organizations willing to improve their BDA and AI applications.
The objective of enRichMyData has been to develop an open software toolbox – the enRichMyData toolbox – comprising practical, robust and scalable components to support organizations in enriching their data with reference data they may have limited knowledge of, as well as supporting data providers in making their data reusable and available in data enrichment processes. The aim of the toolbox was to lower the technological entry barriers by providing support for the definition of highly scalable and replicable data enrichment pipelines through a set of tools and infrastructure services related to capabilities needed during the lifecycle of enrichment pipelines. The toolbox has made the data enrichment process accessible to a wider set of stakeholders by reducing the level of expertise required and enhancing the level of tool support.