Periodic Reporting for period 2 - DECODER (DEveloper COmpanion for Documented and annotatEd code Reference)
Reporting period: 2020-01-01 to 2021-12-31
The needs of all stakeholders across the whole software development life-cycle vary significantly from modelling to implementation to validation to maintenance. Moreover, information is typically scattered across many documents, many of them in natural language text, making it difficult to process and to relate to the relevant parts of the code itself. This lack of traceability is a major impediment to software development, review and maintainance. For instance, it is sometimes difficult to assess whether a function fulfills its intended behavior as described in an ambiguous and/or incomplete English text. On the other hand, manually writing fully formal specification is extremely tedious, especially for large software. Similarly, when facing some maintenance tasks on some unfamiliar piece of software, it can be complicated to understand whether some comment adequately represents what the actual code is doing. Still, at maintenance level, from the mere description of a vulnerability discovered in a third-party dependency code of the project at hand, deciding whether this vulnerability might have an impact on the product, and, if so, how critical it is, is usually not an obvious task. In order to improve this situation, DECODER proposes to use Natural Language Processing (NLP) techniques for extracting information from informal project-related documents and finding correspondances with the code artifacts and/or the associated formal specifications.
In addition, a semi-formal description language (ASFM, standing for Abstract Semi-Formal Model) is designed, together with an abstract graphical tool for animating models. ASFM is generated either from informal documents or from the code. The user can then refine interactively the models. Once they are satisfied with the refinements, they will be able to derive a fully formal specification and use formal methods for verifying the software, or to generate test scenarios that cover all possible behaviors exhibited by the model.
Finally, the PKM and the associated tools developed within the DECODER are assessed on several real-world use-cases in order to ensure that they are up to the task. Feedback from the use-case partners are taken into account throughout the project to improve the tools.
Based on the meta-model, the PKM server and its API were designed and implemented on top of MongoDB, a document-oriented database. The preferred interacting device with the database itself is a RESTful server, which features a powerful set of requests specified through the OpenAPI format.
A prototype for the graphical user interface of the PKM client was designed and implemented, acting as the main user entry point to the PKM. Moreover, an orchestrator was developed that takes care of making the relevant calls to the PKM server and to the individual tools, providing a very smooth user experience for the most standard usages of the platform. In order to facilitate its insertion in existing SW development chains, the PKM server can be synchronized with a git server, ensuring that the code artifacts stored in the PKM are always up-to-date.
On another aspect of the project, a first version of the ASFM language was designed. ASFM takes the PKM meta-model as reference, and is defined as a JSON schema that closely follows the PKM API to ease the task of generating ASFM models from requests to the PKM server. ASFM schemas can be extracted from standard .doc or .pdf documents, and are the basis of the FormalDebug tool that represents graphically the datastructures of the program and can be manipulated interactively to better understand the invariants in the code.
Another work stream concerned NLP and the connections between code and informal documents. Code summarization tools were applied to track potentially misused identifiers and suggest a fix. The identification of correspondances between documentation and code using machine translation techniques also proceeded well. All NLP tools were integrated within the PKM. It is now possible to visualize the traceability matrix between high-level requirements (in English) and the relevant code. Finally, some work was devoted to semantically analyze CVE messages and highlight their main characteristic, in order to help users to assess whether their code is impacted by a given vulnerability.
Finally, the selection of use-cases was refined, ensuring that they are representative of a large class of applications and that they can lead to a wide spectrum of activities involving the PKM and the other DECODER tools. All use-cases were analyzed with the tools developed within the project and the results are stored in the PKM server.
To conclude, DECODER’s results can be extended in various ways, which could be the subject of future collaborative projects. This includes a tighter integration with git servers to let individual tools exploit the correspondances between each successive version of the code base, enhancing the results of NLP tools when it comes to generate formal specification, and extending ASFM to facilitate formal reasoning with graphical representation of structures.