Final Report Summary - SIMPL (Specification and Implementation of Pattern Languages)
Regular expressions [12, 10] are a formalism ideally suited to specification and implementation with formal methods. They are essential for text processing and form the basis of most markup schema languages. Regular expressions are useful in the production of syntax highlighting systems, data validation, speech processing, optical character recognition, and in many other situations when we attempt to recognise patterns in data. Extended versions of regular expressions are used in search engines such as Google Code Search. In fact, there is a difference between what is understood by the term regular expression in programming and in theoretical computer science. Different software based on regular expressions has in each case its own “RegEx flavour”: ECMAScript, Perl-style, GNU RegEx, Microsoft Word, POSIX Basic/Extended RegEx (with extensions), Vim, and many others. In this project, I worked with an algebraic definition of regular expression matching that rests upon the concept of partial derivatives. I appropriately extended algebraic matching of regular expressions to account for backreferences. The project has yielded several peer-refereed papers [7, 3, 6, 5, 9, 8].
The following objectives have been duly reached: Objective 1. Theoretical representation of extended, or, practical, regular expressions in constructive dependent type theory. Objective 2. A Coq library for regular languages and automata that includes features not present in available related libraries, such as backreferences and partial derivatives of regular expressions. Objective 3. A formally certified grep-like extended regular expression parser (in other words, a formally certified compiler of extended regular expressions into finite automata).
A much broader aim of SImPL was helping to provide robust and transparent data infrastructure for the future Internet (which is a part of the European Commission ICT Challenge 1: Pervasive and Trustworthy Network and Service Infrastructures). The primary object of research was data, in contrast with computation, in the sense of the duality emphasised in the seminal paper [2] with respect to [11]. Therefore the intended application of the results of the project is formal data certification. Application to proving computational correctness was not perceived as a specific goal. However, due to the foundational nature of regular expressions, for example, in relationship to concurrency, the results on decision methods for extended regular expressions can be employed in proving correctness of data-parallelism.