The team began with a comprehensive review of existing approaches to synthetic data generation, identifying key limitations in handling relational data structures — datasets where information is distributed across multiple linked tables, as in hospital or financial records. This review informed the design of a novel graph-based method for representing and generating relational data, which was subsequently prototyped using a large language model and shown to match or outperform existing approaches on standard benchmarks.
On the privacy side, a new evaluation metric was designed, implemented, and validated across five real-world datasets. This metric measures the risk that an attacker could determine whether a specific individual's data was used to train the generative model, providing a practical and 1 interpretable privacy signal. A complementary set of utility metrics — assessing how faithfully synthetic data reproduces the statistical patterns of the original — was also developed and validated across various datasets.
To help organizations handle sensitive data safely from the outset, a prototype for automated detection of personally identifiable information was developed using a state-of-the-art named entity recognition model. The system identifies sensitive fields such as names, addresses, and identification numbers in various languages and was tested against a benchmark dataset. Finally, the underlying data generation platform was refactored to support more modular and flexible synthesis workflows, and new features were introduced to enable teams within an organization to share resources and configurations, reducing duplication of effort and lowering the cost of deployment.