A novel synthetic data generation platform that produces private, secure and robust synthetic data for AI use cases

Informacje na temat projektu

SydAi

Identyfikator umowy o grant: 101218531

DOI

10.3030/101218531

Data podpisania przez KE 20 Czerwca 2025

Data rozpoczęcia 1 Kwietnia 2025

Data zakończenia 31 Marca 2027

Finansowanie w ramach

The European Innovation Council (EIC)

Koszt całkowity

Brak danych

Wkład UE

€ 2 080 085,71

Koordynowany przez

AINDO SPA
Italy

Periodic Reporting for period 1 - SydAi (A novel synthetic data generation platform that produces private, secure and robust synthetic data for AI use cases)

Okres sprawozdawczy: 2025-04-01 do 2026-03-31

SYDAI (Grant Agreement #101218531, Horizon Europe EIC Accelerator) develops a privacy-preserving synthetic data generation platform. Organisations handling sensitive personal data — in healthcare, finance, and AI development — face a fundamental tension between data utility and privacy: sharing or reusing real data risks privacy violations; withholding it blocks innovation.
SYDAI addresses this by generating high-fidelity synthetic datasets that statistically mirror real data without exposing individual records. The project pursues three objectives: (1) generation of complex multi-table relational data using state-of-the-art generative models; (2) rigorous, verifiable privacy guarantees backed by novel metrics; and (3) automated detection and masking of personally identifiable information.

The team began with a comprehensive review of existing approaches to synthetic data generation, identifying key limitations in handling relational data structures — datasets where information is distributed across multiple linked tables, as in hospital or financial records. This review informed the design of a novel graph-based method for representing and generating relational data, which was subsequently prototyped using a large language model and shown to match or outperform existing approaches on standard benchmarks.
On the privacy side, a new evaluation metric was designed, implemented, and validated across five real-world datasets. This metric measures the risk that an attacker could determine whether a specific individual's data was used to train the generative model, providing a practical and 1 interpretable privacy signal. A complementary set of utility metrics — assessing how faithfully synthetic data reproduces the statistical patterns of the original — was also developed and validated across various datasets.
To help organizations handle sensitive data safely from the outset, a prototype for automated detection of personally identifiable information was developed using a state-of-the-art named entity recognition model. The system identifies sensitive fields such as names, addresses, and identification numbers in various languages and was tested against a benchmark dataset. Finally, the underlying data generation platform was refactored to support more modular and flexible synthesis workflows, and new features were introduced to enable teams within an organization to share resources and configurations, reducing duplication of effort and lowering the cost of deployment.

The privacy metric developed is the first computationally practical membership inference-inspired applicable to multi-table relational data, filling a recognized gap in the literature. Two publications were accepted at AAAI (2025 and 2026), covering relational data generation and constrained synthesis. Two arXiv preprints on probabilistic circuit-based generation and on structured synthetic data privacy metrics are also associated with the project. Two PhD theses directly aligned with SYDAI objectives were produced at the University of Trieste. The expected impact includes reduced barriers to privacy-compliant data sharing in regulated sectors, enabling accelerated AI development on sensitive datasets while supporting GDPR compliance.

Example datasets 2.

Example datasets 1

Periodic Reporting for period 1 - SydAi (A novel synthetic data generation platform that produces private, secure and robust synthetic data for AI use cases)

Pobierz Pobierz zawartość strony