Logic and Automata over Sequences with Data

Informazioni relative al progetto

LASD

ID dell’accordo di sovvenzione: 101089343

DOI

10.3030/101089343

Data della firma CE 25 Maggio 2023

Data di avvio 1 Agosto 2023

Data di completamento 31 Luglio 2028

Finanziato da

European Research Council (ERC)

Costo totale

€ 1 998 956,00

Contributo UE

€ 1 998 956,00

1 998 956,00

Coordinato da

RHEINLAND-PFALZISCHE TECHNISCHE UNIVERSITAT
Germany

Periodic Reporting for period 1 - LASD (Logic and Automata over Sequences with Data)

Periodo di rendicontazione: 2023-08-01 al 2026-01-31

Modern digital technologies—from online platforms and mobile apps to financial trading systems and medical software—rely heavily on reasoning about sequences of data. These sequences may be strings of text, lists of numbers, program arrays, or time-series data. While current computer science methods handle sequences over small, fixed alphabets very well, real-world applications involve infinitely many possible data values, such as numbers, strings, or identifiers.

This mismatch creates major difficulties for verification of software, analysis of data-intensive applications, and explainability of machine learning models. Many critical questions—such as whether a list-manipulating program is error-free, answering a graph database query, or explaining the behavior of a sequential neural model (e.g. a recurrent neural network or a transformer) — remain extremely hard to answer with today’s methods.

The ERC Consolidator Grant LASD (Logic and Automata over Sequences with Data) addresses these challenges by developing new mathematical models (inspired by logic and automata), algorithms, and software tools for reasoning about sequences with complex data. The project focuses on three interconnected scientific hurdles:
- Complex data reasoning: creating expressive yet decidable models that allow arithmetic, string, and other advanced operations.
- Relational reasoning: handling relations between multiple data sequences, e.g. as needed in program analysis.
- Aggregation: supporting operations such as sum, average, or median, which are essential (among others) for database applications.

The expected impact is twofold: advancing fundamental computer science by overcoming long-standing barriers in automata and logic, and developing prototypes showcasing the future potential of a technology transfer in the domains of graph databases, program verification, and interpretable AI.

The project aims to make advances in the following areas:

WP1 (Data languages modulo theory): the goal is to develop expressive models for data languages modulo theory that are still amenable to algorithmic analysis. The main achievements on this end include (i) Parikh's theorem for symbolic automata, (ii) abstract interpretation method for array systems, (iii) new models inspired by the theory of neural networks.

WP2 (Querying data graphs): we aim to conduct theoretical analyses of a new language standard for graph database queries (GQL), and develop a new tool for handling query answering that involves complex data constraints (i.e. first scalable tool for querying constraint databases).

WP3 (Relational reasoning over sequences): we are developing multiple techniques for analyzing complex constraints over sequences, and applying them to program analysis. In particular, we are studying SMT solving over sequences and lifting it to program analysis via the framework of Constraint Horn Clauses (CHC). We have implemented some of these techniques in the solver HornStr and OSTRICH2, which can handle large (unicode) alphabet via the framework of symbolic automata modulo theory.

WP4 (Data aggregation): Initial progress has been made in identifying decidable models (called Register Automata with Accummulator) that could incorporate aggregation functions such as sum or median into logical frameworks. We have developed a first prototype of the algorithm for analyzing the model.

WP5 (Learning specifications from data sequences): We have begun exploring suitable methods for extracting specifications from neural models. In particular, we have by now several algorithms for extracting explainable models from transformers and Recurrent Neural Networks (in terms of formalsims that we developed in earlier WPs, e.g. Register Automata with Accumulator). We are also exploring new connections between formal language theory and machine learning; the goal is to develop models, which inherit learnability from ML, while permitting algorithmic analysis (via verification).

Overall, the project has produced a mix of theoretical contributions (published at top venues like CAV, OOPSLA, POPL, ICLR, NeurIPS) and tool prototypes (e.g. the OSTRICH2 solver, CHC solver), while at the same time laying groundwork in areas where more results are expected in the second half of the project. Together, these outcomes demonstrate steady progress towards the project’s overarching aim of developing expressive models over sequences (inspired by logic and automata) and algorithmic methods for analyzing them.

- Breaking undecidability/intractability barriers: (i) Efficient Parikh's theorem for symbolic automata and (ii) abstract interpretation for array systems.
- SMT solver with complex operations (concatenation and tranducers), and a CHC solver on top of it.
- First constraint database system
- Cross-fertilization with machine learning: new explainable models inspired by formal languages and machine learning.

Periodic Reporting for period 1 - LASD (Logic and Automata over Sequences with Data)

Scarica Scarica il contenuto della pagina