Skip to main content

Algorithmic Verification of String Manipulating Programs

Periodic Reporting for period 3 - AV-SMP (Algorithmic Verification of String Manipulating Programs)

Reporting period: 2020-09-01 to 2022-02-28

Strings have always been an important data type in programming languages. With the rise in popularity of scripting languages like JavaScript, Python, and PHP, this statement is truer than ever. Despite this, string manipulation is error-prone, which could lead to serious security vulnerabilities including cross-site scripting (XSS) and other types of code injection (e.g. SQL Injection). Analyzing and verifying the correctness of string-manipulating programs requires string reasoning that is currently beyond the state-of-the-art. The goals of the project are to develop novel algorithms for verifying string-manipulating programs (including important properties like safety and termination), as well as transforming them into robust verification tools. In particular, this involves designing "well-behaved" constraint languages over strings (e.g. permitting decidability with good complexity), and semi-algorithms for verifying string-manipulating programs with theoretical guarantee. This is an extremely challenging task, which might lead us to potentially solving a long-standing open problem. We will also develop novel implementation techniques that can overcome the inherent worst-case computational complexities for string constraint solving. Finally, as a proof-of-concept, we will apply our technologies to two key application domains: (1) automatic detection of XSS vulnerabilities in web applications, and (2) automatic grading systems for a programming course.
The project is divided into 5 work packages (WPs):
1. Decidability of Constraint Languages over Strings
2. Practical algorithms for constraint languages over strings
3. Semi-algorithms for verifying string programs
4. Extension to combination with other data types (e.g. integers and arrays)
5. Applications

Most of the developments in the first half of the project concerns WP 1, WP 2. We have met most of the proposed activities and goals. Representative publications are Articles 1, 2, 3, 5, 6 (see Publications) - three of which were published at POPL and one at ICALP, two of the top and most prestigious venues for the topic. As for WP1, we have successfully delineated the boundary of decidability for string constraints, wherein we have come up with a general framework for string constraint solving with complex constraints (see Article 1). In particular, the framework allows very expressive transducers that can capture transductions used for web templating. We have also made a small step towards the most difficult problem of decidability of word equations with length constraints in Article 5, in which we have highlighted a new connection with the existential theory of Presburger Arithmetic with divisibility. On the one hand, it shows that earlier techniques for proving this are unlikely to work since they mostly focus on usage of Presburger Arithmetic, which is insufficient. On the other hand, we have come up with one of the most expressive fragments known on word equations with length constraints that are decidable. In general, adding length constraints is possible when we restrict to “straight-line” fragments, while at the same time restricting the transducers (Article 2). However, length constraints still impose some major problems on the string constraint solving part. To this end, we have studied methods called monadic decomposition which aim to remove length constraints from the string constraints (Article 6, which is improved in two recently accepted articles at IJCAR'20 and ATVA'20).

Article 6 also makes a first step towards WP 4, for string constraints with integer data types. We have made substantial contributions towards this in the last 6-8 months, which resulted in two recently accepted papers in IJCAR'20 and ATVA'20.

As for WP 3, and WP 5, we have made a number of advances in this direction. In particular, Articles 4, 7, and 8 represent applications of the results of this project. Article 4 provides an application of string constraints to reasoning about permissions in concurrent programs. Article 7 provides applications of string constraints to optimization of cascading style sheets (in particular, to analysing CSS selectors). Finally, Article 8 provides applications of string constraints to analysing probabilistic bisimulation of programs, with applications to checking anonymity of communication protocols. Article 8 also partly contributes to WP3, since it develops a semi-algorithm using string constraint solving for program verification. Two of these results were published at TOPLAS and CAV, two of the most prestigious venues on the topic.
We summarize the achievements of the project so far as follows:
1. The development of decidable string constraint language with complex string operations (including concatenation, real-world regular expressions, replaceall, length constraints, and transducers) and an efficient solver for it (OSTRICH). This is the first string solver that supports complex string operations (including replaceall, regular real-world expressions, and transducers), while remaining competitive with existing solvers on benchmarks without these complex operations.
2. The most expressive known fragment of word equations with length constraints. We have shown a strong connection between word equations with length constraints and existential Presburger with divisibility. This is the first decidability result that exploits this connection.
3. Applications to verification of string-manipulating programs. In particular, the verification of complex properties (e.g. anonymity) of models of distributed protocols using string constraint engine.
4. Certified String Solver (Distinguished Paper Award of CPP'22)

An important next stage of the project is to develop a symbolic execution (i.e. program analysis) engine using our string constraint solver. This has currently the most active part of the project and is crucial for the success of the project. This is an extremely challenging part of the project, and we have made an initial progress in a recent paper published in POPL'22. In summary, we have initiated an integration of OSTRICH into the symbolic execution engine called Aratha. This has allowed us to generate string constraints from a program under test. One of the main challenges addressed in this paper is the use of real-world regular expressions, which may use complex features like greediness, capture groups, references, etc. We are still working to improve the robustness/preciseness of our string constraint generation.

Another feature that we have been working on is the robustness of OSTRICH alone. In particular, we are aiming to prepare for a submission to SMT Competition 2022.