Skip to main content
European Commission logo
polski polski
CORDIS - Wyniki badań wspieranych przez UE
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary

Spatio-TEmporal Linked data tools for the AgRi-food data space

Periodic Reporting for period 1 - STELAR (Spatio-TEmporal Linked data tools for the AgRi-food data space)

Okres sprawozdawczy: 2022-09-01 do 2024-02-29

Data Lakes and Data Spaces comprise large amounts of data of various types, schemas, models, languages, and formats, from diverse sources, and with varying quality. Hence, finding the right data is challenging. Data discovery typically relies on metadata, which is often missing or incomplete. High-quality metadata requires a significant amount of manual curation, which is costly and cannot scale to the constantly increasing data volumes, thus becoming a bottleneck for data sharing and reuse, and a barrier for adopting data-driven AI technologies. Heterogeneity is both structural (different schemas and formats) and semantic (different vocabularies). Hence, mechanisms for automatically mapping and translating between different models and representations are required. Information about the same entity is often fragmented across multiple, diverse sources, and needs to be linked and aligned. For instance, entities extracted from unstructured sources have different name variations (e.g. official vs. commercial names of companies or products), spelling errors, or different languages (e.g. food incident reports in different countries); geospatial and time series data coming from different sources have different resolutions (e.g. satellite data vs. data from field sensors). Finally, Deep Learning requires large amounts of labeled instances to robustly train models (e.g. to automatically detect and classify crops in satellite images). AI systems require semantically annotated data from their environment to make reliable decisions and take appropriate actions. Thus, data needs to be AI-ready. However, data annotation and labeling require domain knowledge and expertise, implying a high cost in terms of time and effort of professional domain experts. This prevents many applications from benefiting from the use of AI, due to the lack of training datasets.

The STELAR project will design, develop, evaluate, and showcase an innovative Knowledge Lake Management System (KLMS) to support and facilitate a holistic approach for FAIR (Findable, Accessible, Interoperable, Reusable) and AI-ready (high-quality, reliably labelled) data. The STELAR KLMS will allow to (semi-)automatically turn a raw data lake into a knowledge lake by: (a) enhancing the data lake with a knowledge layer; and (b) developing and integrating a set of data management tools and workflows. The KLMS will combine both human-in-the-loop and automatic approaches, leveraging background knowledge of domain experts, while minimizing their involvement. An organization, such as a data-intensive SME or the operator of a data marketplace, will be able to use the STELAR KLMS to increase the readiness of its data assets for use in AI applications and for being shared and exchanged within a common data space.

The STELAR KLMS will be pilot-tested in diverse, real-world use cases in the agrifood data space, one of the data spaces of strategic societal and economic importance identified in the European Strategy for Data. The food supply chain covers all stages from production to transport, distribution, marketing, and consumption. Thus, the agrifood data space involves various stakeholders, including producers, advisors, machinery manufacturers, processing actors, inspectors, certification authorities, insurance companies, governmental agencies, all of which have an interest or even legal obligation to exchange and share data. We will conduct three pilots, covering different stages of the food chain, involving and combining different types of data, and addressing different stakeholders and user needs: (1) Risk prevention in food supply lines, integrating worldwide food safety related data sources; (2) Early crop growth predictions, integrating current and historical satellite, hyperspectral, meteorological and synthetic data; (3) Timely precision farming interventions, integrating different types of sensor data from the field.
To improve data discovery, we have developed a data profiler, we have designed and implemented new types of data synopses, we have setup a data catalog, we have designed a metadata schema for datasets and workflows, and we have designed mappings that allow us to transform our metadata repository into a virtual knowledge graph.

To improve data linking, we have developed a library that offers several algorithms for schema and entity matching. We have made several improvements in terms of efficiency and scalability, and we have conducted experimental analyses comparing several pre-trained language models on various benchmark datasets. We have worked on fusing data from multiple satellite sources with different characteristics. We have also developed an efficient and customizable library for discovering complex correlations.

To increase the AI-readiness of data, we have made progress on several domain-specific tasks. We have addressed the problem of food entity extraction from unstructured data from several sources, improving the results through bias detection and data augmentation. With respect to satellite images, we have designed and examined methods for field segmentation and for crop classification.

Towards pilot testing, we have specified several use cases, designed a first version of the architecture for the KLMS Platform, and we have made progress on integrating and deploying the KLMS Platform and Tools.
Compared to data catalogs and ML tracking tools, the KLMS platform collects more fine-grained metadata, links metadata between datasets and workflows, and provides more advanced functionalities for search, comparison and ranking. Moreover, the KLMS tools offer several improvements compared to the state of the art, such as new types of synopses for data summarization, new techniques for entity linking, novel algorithms that enable multivariate correlation discovery, and more sophisticated methods for bias detection and data augmentation.
logo-stelar-project.png