Skip to main content
European Commission logo print header

Algorithmic Verification of String Manipulating Programs

Periodic Reporting for period 4 - AV-SMP (Algorithmic Verification of String Manipulating Programs)

Reporting period: 2022-03-01 to 2022-10-31

Strings have always been an important data type in programming languages. With the rise in popularity of scripting languages like JavaScript, Python, and PHP, this statement is truer than ever. Despite this, string manipulation is error-prone, which could lead to serious security vulnerabilities including cross-site scripting (XSS) and other types of code injection (e.g. SQL Injection). Analyzing and verifying the correctness of string-manipulating programs requires string reasoning that is currently beyond the state-of-the-art. The goals of the project are to develop novel algorithms for verifying string-manipulating programs (including important properties like safety and termination), as well as transforming them into robust verification tools. In particular, this involves designing "well-behaved" constraint languages over strings (e.g. permitting decidability with good complexity), and semi-algorithms for verifying string-manipulating programs with theoretical guarantee. This is an extremely challenging task, which might lead us to potentially solving a long-standing open problem. We will also develop novel implementation techniques that can overcome the inherent worst-case computational complexities for string constraint solving. Finally, as a proof-of-concept, we will apply our technologies to two key application domains: (1) automatic detection of XSS vulnerabilities in web applications, and (2) automatic grading systems for a programming course.

Conclusion of the action: the area of algorithmic verification of string-manipulating programs has come along way since the start of the ERC project AV-SMP. We have a better understanding of which string logics are decidable, what kinds of string operations result in undecidability, and restrictions with which to recover decidability. This was to a large extent owing to the contributions of AV-SMP publications, which initiated the exploration of string logics with "complex operations" including replace-all, transductions, real-life regular expressions, etc. During the lifespan of AV-SMP, we have developed three string solvers (OSTRICH, SLOTH, and CertiStr). Team members of AV-SMP have been an integral part of the SMT-LIB discussion that resulted in a "standard string logic" with which SMT Competition for string solvers started (in 2021). Our solver OSTRICH was listed #1 in the category "largest contribution for unsatisfiable instances" in SMT Competition 2022. Our team also has led the effort of integrating string solving and program analysis techniques; most notably we were the first to incorporate string solvers into symbolic execution tools (e.g. Aratha and Expose), which require a precise modeling of real-life regular expressions (with their intricate features like backreferences, greediness, etc.) using finite-state transducers. Of course, real-life programs also make use of other data types (e.g. integers and arrays). For example, length functions and str2num function are functions from strings to integers, while JavaScript functions like match and split output an array of strings. We were the first to integrate such functions to a string solver, although a generic solution to integrate arrays and strings is believed to be a highly challenging problem that require dedicated future research effort. Finally, our solver has successfully solved benchmarks that are derived from aforementioned application domains (XSS vulnerabilities and automatic grading systems). Our implementations have always been intended to be proofs of concepts, although with further engineering effort, it should be possible to incorporate our ideas and initial implementations into real-life systems (e.g. automatic grading systems for MOOCS). For a robust implementation, we have also provided the first verified string solver implementation (see our CPP'22 Distinguished Paper Award).
The project is divided into 5 work packages (WPs):
1. Decidability of Constraint Languages over Strings
2. Practical algorithms for constraint languages over strings
3. Semi-algorithms for verifying string programs
4. Extension to combination with other data types (e.g. integers and arrays)
5. Applications

WP 1, WP 2: Four resulting articles were published at POPL, one at CAV, and one at ICALP, three of the top and most prestigious venues for the topic. As for WP1, we have successfully delineated the boundary of decidability for string constraints, wherein we have come up with an expressive general framework for string constraint solving with complex constraints which, in particular, can capture transductions used for web templating. We have also made a small step towards the most difficult problem of decidability of word equations with length constraints, in which we have highlighted a new connection with the existential theory of Presburger Arithmetic with divisibility. On the other hand, we have come up with one of the most expressive fragments known on word equations with length constraints that are decidable. In general, adding length constraints is possible when we restrict to “straight-line” fragments, while at the same time restricting the transducers). However, length constraints still impose some major problems on the string constraint solving part, despite decidability. To this end, we have provided a generic method based on parikh automata/cost-register automata and "approximate" methods that perform better in practice called monadic decomposition aiming to remove length constraints from the string constraints. We also showed how real-world regular expressions could be incorporated into a string solver with a precise transducer encoding. Striving to provide a robust implementation, we gave the first verified implementation of a string solver, which won CPP'22 Distinguished Paper Award.

WP3: Publication venues include LICS, POPL, CAV, and AAMAS, which are prestigious conferences for automated reasoning and verification. We developed techniques for verifying safety and liveness, as well as program synthesis at the same time, in the context of restricted string-manipulating programs. We introduced an important notion of "directed Ramsey quantifiers". Finally, we show how our verification framework can be extended to a rich class of properties by using existential second-order specifications.

WP4: We provided methods for handling length functions and several functions (including match) that output arrays of strings. To provide more general solutions that handle strings and arrays, we considered theory of sequences. We have made initial advances in this direction resulting in LICS and PODS publications, which are prestigious conferences for logic in computer science.

WP 5: Notable publication venues include CAV, POPL, TOPLAS and PODS. We show that string solving has numerous applications including reasoning about permissions in concurrent programs, analysis of web applications, testing and bug detections, and database applications.
We summarize the achievements of the project so far as follows:
1. The development of decidable string constraint language with complex string operations (including concatenation, real-world regular expressions, replaceall, length constraints, and transducers) and an efficient solver for it (OSTRICH). This is the first string solver that supports complex string operations (including replaceall, regular real-world expressions, and transducers), while remaining competitive with existing solvers on benchmarks without these complex operations.
2. The most expressive known fragment of word equations with length constraints. We have shown a strong connection between word equations with length constraints and existential Presburger with divisibility. This is the first decidability result that exploits this connection.
3. Applications to verification of string-manipulating programs. In particular, the verification of complex properties (e.g. anonymity) of models of distributed protocols using string constraint engine.
4. Certified String Solver (Distinguished Paper Award of CPP'22)
5. OSTRICH was listed #1 in SMT Competition 2022 in the category "largest contributions for unsatisfiable instances for string constraints".
6. Precise modelling of real-world regular expressions in an SMT-solver (using transducers) and integration of OSTRICH with a symbolic execution engine for JavaScript (called Aratha). [POPL'22]