Data Lakes and Data Spaces comprise large amounts of data of various types, schemas, models, languages, and formats, from diverse sources, and with varying quality. Hence, finding the right data is challenging. Data discovery typically relies on metadata, which is often missing or incomplete. High-quality metadata requires a significant amount of manual curation, which is costly and cannot scale to the constantly increasing data volumes, thus becoming a bottleneck for data sharing and reuse, and a barrier for adopting data-driven AI technologies. Heterogeneity is both structural (different schemas and formats) and semantic (different vocabularies). Hence, mechanisms for automatically mapping and translating between different models and representations are required. Information about the same entity is often fragmented across multiple, diverse sources, and needs to be linked and aligned. For instance, entities extracted from unstructured sources have different name variations (e.g. official vs. commercial names of companies or products), spelling errors, or different languages (e.g. food incident reports in different countries); geospatial and time series data coming from different sources have different resolutions (e.g. satellite data vs. data from field sensors). Finally, Deep Learning requires large amounts of labeled instances to robustly train models (e.g. to automatically detect and classify crops in satellite images). AI systems require semantically annotated data from their environment to make reliable decisions and take appropriate actions. Thus, data needs to be AI-ready. However, data annotation and labeling require domain knowledge and expertise, implying a high cost in terms of time and effort of professional domain experts. This prevents many applications from benefiting from the use of AI, due to the lack of training datasets.
The STELAR project designs, develops, evaluates, and showcases an innovative Knowledge Lake Management System (KLMS) to support and facilitate a holistic approach for FAIR (Findable, Accessible, Interoperable, Reusable) and AI-ready (high-quality, reliably labelled) data. The STELAR KLMS allows to (semi-)automatically turn a raw data lake into a knowledge lake by: (a) enhancing the data lake with a knowledge layer; and (b) developing and integrating a set of data management tools and workflows. The KLMS combines both human-in-the-loop and automatic approaches, leveraging background knowledge of domain experts, while minimizing their involvement. An organization, such as a data-intensive SME or the operator of a data marketplace, can use the STELAR KLMS to increase the readiness of its data assets for use in AI applications and for being shared and exchanged within a common data space.
The STELAR KLMS is pilot-tested in diverse, real-world use cases in the agrifood domain, one of the domains of strategic societal and economic importance identified in the European Strategy for Data. The food supply chain covers all stages from production to transport, distribution, marketing, and consumption. Thus, the agrifood domain involves various stakeholders, including producers, advisors, machinery manufacturers, processing actors, inspectors, certification authorities, insurance companies, governmental agencies, all of which have an interest or even legal obligation to exchange and share data. The project conducts three pilots, covering different stages of the food chain, involving and combining different types of data, and addressing different stakeholders and user needs: (1) Risk prevention in food supply lines, integrating worldwide food safety related data sources; (2) Early crop growth predictions, integrating current and historical satellite, hyperspectral, meteorological and synthetic data; (3) Timely precision farming interventions, integrating different types of sensor data from the field.