Periodic Reporting for period 4 - DEVINTA (An Artificial Assistant for Software Developers)
Periodo di rendicontazione: 2024-08-01 al 2025-07-31
The DEVINTA project aims at introducing models and techniques serving as the basis for the next generation of recommender systems supporting software developers. These recommenders are expected to support developers in comprehending unfamiliar code and in writing high-quality code faster, thus reducing the considerable costs of developing and maintaining complex software. In particular DEVINTA aims at providing support to developers in different phases of the software lifecycle, with three main challenges being tackled:
1. Support developers in program comprehension activities by translating a given code into a natural language text.
2. While the developer is implementing software, predicting the feature they are working on and suggest how to automatically complete it.
3. Provide support for online code review, meaning the ability to review in real time the code written by the developer, looking for possible quality issues.
We exploited deep learning (DL) models to automatically recommend developers how to finalize an ongoing implementation task. DL models can be trained to "learn" how to deal with a specific task by looking at concrete examples (i.e. training set). We provide the DL model with millions of examples of source code written by developers. We showed that DL models can correctly guess the next few code tokens the developer is likely to write in ~70% of cases. When the prediction task becomes more complex (i.e. predicting dozens of tokens), the performance drops to ~30%, with the DL model still being able to generate quite complex code snippets. This work has been presented at the MSR'21 conference and in the TSE journal.
Given the positive results we achieved, we investigated the extent to which DL models tend to copy code from their training set when recommending code. Such a research question is particularly important considering the fact that most of DL-based code recommenders have been trained on the source code of open source repositories and it is unclear whether the code they generate should be considered as new or as derivative work, with possible implications on license infringements. We showed that ~10% to ~0.1% of the predictions generated by DL-based code recommenders are exact copies of instances in the training set (MSR'22 conference).
We also showed the importance of building a high-quality training set for DL models targeting code generation: Feeding low-quality code, even in small percentages, results in a major increase of low-quality code produced by the model (ICPC'25 conference).
*Automating code-related tasks*
There are several tasks revolving around software writing. In DEVINTA we targeted their (partial) automation with the goal of saving time to software developers. We focused on the usage of large pre-trained DL models. To better understand the idea behind these models, let's assume we are interested in training a DL model able to translate from Italian to English. The training would usually require to provide the model with several examples of Italian sentences translated to English. Creating such a training dataset requires manual effort. For this reason, these datasets are usually limited in size, with consequences on the model performance. The idea behind pre-training is to firstly "teach" the model basic features about the languages of interest without the need for a manually built dataset. For example, the model is provided as input with sentences of the language of interest (e.g. Italian sentences) having specific words masked and it is required to guess the masked words. Only after pre-training, the model is "fine-tuned" to learn the specific task of interest (in our example, the translation task). We showed that pre-training substantially boost performance in the automation of several code-related tasks (e.g. bug-fixing), with results published at the ICSE'21, ICSME'21, ICSE'22, ICSE'23, ICSE'24, and ICPC'24 conferences and in the TSE, JSS, and EMSE journals. We also documented 45 tasks which developers automate via DL, presenting our findings at MSR'24 and receiving a Distinguished Paper Award.
*Online code review*
Code review is the process of analyzing source code written by a teammate to judge whether it is of sufficient quality to be integrated into the software project. We presented the first approach in the literature taking as input a previously unseen code and recommending code changes as a reviewer would do. These findings are detailed in works presented at the ICSE'21 and ICSE'22 conferences, and have been quite impactful, resulting in many follow-up works on the same topic. Finally, we run a controlled experiment with developers to assess the extent to which AI-based code review actually helps them in finding more quality issues. We found out that, while the AI is able to find quality issues, it also impacts the developers' behavior: Developers experience a tunnel-vision effect, focusing only on parts of the code commented by the AI and missing quality issues in other parts of the code (ICSE'25 conference).
Concerning the automation of code-related tasks, we showed how large pre-trained models can help in substantially boosting performance across a variety of tasks.
DEVINTA, with our ICSE'21 and ICSE'22 papers on code review, started a research thread on the automation of non-trivial code review activities, such as the automated reporting of issues in code components via natural language sentences, as human reviewers would do. Since then, several research groups joined on this thread and major steps in this direction have been done.
The final major contribution of DEVINTA has been one of the very first studies showing changes in developers’ behavior resulting from the usage of an AI assistant (see ICSE'25 paper on the impact of AI-based code review).