Strings have always been an important data type in programming languages. With the rise in popularity of scripting languages like JavaScript, Python, and PHP, this statement is truer than ever. Despite this, string manipulation is error-prone, which could lead to serious security vulnerabilities including cross-site scripting (XSS) and other types of code injection (e.g. SQL Injection). Analyzing and verifying the correctness of string-manipulating programs requires string reasoning that is currently beyond the state-of-the-art. The goals of the project are to develop novel algorithms for verifying string-manipulating programs (including important properties like safety and termination), as well as transforming them into robust verification tools. In particular, this involves designing "well-behaved" constraint languages over strings (e.g. permitting decidability with good complexity), and semi-algorithms for verifying string-manipulating programs with theoretical guarantee. This is an extremely challenging task, which might lead us to potentially solving a long-standing open problem. We will also develop novel implementation techniques that can overcome the inherent worst-case computational complexities for string constraint solving. Finally, as a proof-of-concept, we will apply our technologies to two key application domains: (1) automatic detection of XSS vulnerabilities in web applications, and (2) automatic grading systems for a programming course.
Conclusion of the action: the area of algorithmic verification of string-manipulating programs has come along way since the start of the ERC project AV-SMP. We have a better understanding of which string logics are decidable, what kinds of string operations result in undecidability, and restrictions with which to recover decidability. This was to a large extent owing to the contributions of AV-SMP publications, which initiated the exploration of string logics with "complex operations" including replace-all, transductions, real-life regular expressions, etc. During the lifespan of AV-SMP, we have developed three string solvers (OSTRICH, SLOTH, and CertiStr). Team members of AV-SMP have been an integral part of the SMT-LIB discussion that resulted in a "standard string logic" with which SMT Competition for string solvers started (in 2021). Our solver OSTRICH was listed #1 in the category "largest contribution for unsatisfiable instances" in SMT Competition 2022. Our team also has led the effort of integrating string solving and program analysis techniques; most notably we were the first to incorporate string solvers into symbolic execution tools (e.g. Aratha and Expose), which require a precise modeling of real-life regular expressions (with their intricate features like backreferences, greediness, etc.) using finite-state transducers. Of course, real-life programs also make use of other data types (e.g. integers and arrays). For example, length functions and str2num function are functions from strings to integers, while JavaScript functions like match and split output an array of strings. We were the first to integrate such functions to a string solver, although a generic solution to integrate arrays and strings is believed to be a highly challenging problem that require dedicated future research effort. Finally, our solver has successfully solved benchmarks that are derived from aforementioned application domains (XSS vulnerabilities and automatic grading systems). Our implementations have always been intended to be proofs of concepts, although with further engineering effort, it should be possible to incorporate our ideas and initial implementations into real-life systems (e.g. automatic grading systems for MOOCS). For a robust implementation, we have also provided the first verified string solver implementation (see our CPP'22 Distinguished Paper Award).