European Commission logo
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS

Synthesising Inductive Data Models

Periodic Reporting for period 4 - Synth (Synthesising Inductive Data Models)

Período documentado: 2021-03-01 hasta 2022-02-28

The availability of data and the possibilities for analysis are revolutionalizing our society and our businesses. But the data science process is painful and requires highly skilled experts. Inspired by recent successes in AI towards automating highly complex jobs, the goal of Synth is to automate the task of the data scientist. This should democratise data science.

Synth wants to automate data science by developing the foundations of a theory and methodology for automatically synthesising inductive data models. An inductive data model (IDM) consists of 1) a data model (DM) that specifies an adequate data structure for the dataset (just like a database or a spreadsheet), and 2) a set of inductive models (IM), that is, a set of patterns and models that have been discovered in the data. While the DM can be used to retrieve information about the dataset and to answer questions about specific data points, the IMs can be used to make predictions, propose values for missing data, identify outliers, find inconsistencies or violations of constraints, etc. The goal of Synth is to automatically synthesise such inductive data models from past data with only minimal supervision by a data scientist, that is, to democratise the task of the data scientist. It is assumed that the data set consists of a set of tables in a spreadsheet or relational database, that the end-user interacts with the IDM via a visual interface, and the data scientist has access to a unifying IDM language offering a number of core IMs and learning algorithms.

The key challenges tackled in Synth are: 1) the system must ”learn the learning task”: it should identify the right learning tasks and learn appropriate IMs for each of these; 2) the system may need to restructure the data set before IM synthesis can start: it should perform the data wrangling step automatically; and 3) a unifying IDM language for a set of core patterns and models is needed to support the data scientist.
We have made significant contributions to the overall vision on automating data science. More specifically:

1) We have established probabilistic logic programming, in particular the ProbLog language, as a framework for inductive modelling. ProbLog tightly integrates logic and probability and now also neural networks. While it was already possible to learn the parameters and the structure (rules) of ProbLog programs, Synth contributed many novel reasoning and learning techniques, in particular, w.r.t. mixed discrete and continuous distributions, often based on weighted model integration. Most surprising, and unplanned, was that we were able to integrate neural networks with probabilistic logic programming using the concept of a neural predicate. The resulting DeepProbLog framework for neurosymbolic and probabilistic logic AI has inspired a new methodology “From StarAI to NeSy AI”.

2) Data wrangling is an often required pre-processing step in data science that restructures the data into a format that is amenable to machine learning. We have contributed several algorithms for automating data wrangling in spreadsheets: Synth-a-Sizer transforms spreadsheets using heuristic search into attribute-value format, Muppets automatically infers semantic types from spreadsheets, Avatar automatically determines relevant features by data wrangling at the feature level, SplyCi fuses several spreadsheets, and Mistle restructures and compresses binary data. Finally, FlashGPT3 combines the GPT3 language model with Excel's Flashfill to synthesise semantic transformations from few examples in spreadsheets.

3) While constraints are commonly used in AI, there are only few approaches that learn constraints from examples. Synth has drawn much more attention to this important problem. We have contributed many techniques for learning constraints from examples, in different settings and for different representations. This includes learning Excel formulae in spreadsheets, learning SMT(LRA), mathematical programs, counting constraints, and WMI formulae. In combinational optimisation, one does not only use constraints but also optimisation functions. Existing approaches were able to learn either the constraints or the optimisation function but not both at the same time. We contributed an appealing setting and approach for learning both components simultaneously from contextual examples.

4) A simple yet operational view on automated data science is that of autocompletion in a spreadsheet. Imagine the end-user of spreadsheet software is filling out some entries, and assume that there are regularities in the data, and that the data has been entered in a systematic manner. The autocompletion task is then to automatically predict the right values for such cells as well as an estimate of the confidence of prediction. Solving the autocompletion task is, in a nutshell, the overall task addressed by Synth, as it requires solutions to all of the addressed problems, ranging from restructuring, to learning constraints, and inference with the resulting models and constraints. Several approaches to autocompletion have been contributed. MERCS and Psyche focus on autocompletion in a single table, with MERCS employing multi-directional ensembles of multi-target decision trees and Psyche using probabilistic inference to combine constraints with predictive models, while Dice-ML is an extended multi-relational autocompletion approach based on hybrid ProbLog programs.

We have also used sketches and colourings as a basis for guiding the automated data science process in a visual interface built on top of spreadsheet software. A prototype of the resulting VisualSynth framework encompasses data wrangling, autocompletion, constraint learning, and anomaly detection. Various demos have been contributed including for learning formulae in Excel, for constraint learning and for automated data science in general.

5) Algorithms and approaches of Synth have been applied to learn constraints and models for combinatorial optimisation, such as nurse rostering, for robotics, and have also delivered inspiration for other applications involving data wrangling and inductive programming.

The results of Synth have been disseminated in numerous keynote and tutorial presentations at major conferences such as AAAI, IJCAI, CCAI, ECAI, ESWC, etc.
When comparing Synth to the state-of-the-art, the following are important.
First, while most approaches to automating data science and machine learning focus on the modeling step, Synth focusses on the overall data science process.
Second, Synth focusses on integrating learning and reasoning, with probabilistic, logical and neural components in a unifying language called ProbLog. Synth devoted also a lot of attention to learning constraints.
Third, Synth’s grand challenge is to democratise data science. Synth addresses this via its predictive autocompletion setting and its visual interface based on sketching .
Fourth, Synth aimed at identifying a small and principled set of necessary components for automated data science and developed a unifying language for this.
Synth Vision