Reinforcement learning to improve proof-automation in theorem proving

Projektinformationen

DeepIsaHOL

ID Finanzhilfevereinbarung: 101102608

DOI

10.3030/101102608

Projekt abgeschlossen

EK-Unterschriftsdatum 9 März 2023

Startdatum 1 Juli 2023

Enddatum 31 Oktober 2025

Finanziert unter

Marie Skłodowska-Curie Actions (MSCA)

Gesamtkosten

Keine Daten

EU-Beitrag

€ 166 278,72

Koordiniert durch

CESKE VYSOKE UCENI TECHNICKE V PRAZE
Czechia

Periodic Reporting for period 1 - DeepIsaHOL (Reinforcement learning to improve proof-automation in theorem proving)

Berichtszeitraum: 2023-07-01 bis 2025-10-31

Context: Interactive theorem provers (ITPs), or proof assistants, are tools that aid in the mechanical validation of mathematical proofs. Due to their high reliability, researchers and engineers use them to develop safe and secure software/hardware or to certify complex mathematical results. Companies like Amazon, Apple, and ARM use ITPs to verify safety-critical systems. The DeepIsaHOL project was set with the long-term goal of improving the efficiency and accessibility of deductive verification methods using ITPs. In particular, the project focuses on the Isabelle proof assistant. The central challenge motivating the project was that while ITPs provide high reliability for developing safe and secure software/hardware and for certifying complex mathematical results, deductive verification is currently slow and costly compared to less reliable quality assurance methods like testing, simulation, or model checking. The premise was that the lack of robust, generic proof-automation methods within ITPs acts as a barrier to wider adoption. This affects both commercial entities and the broader mathematical community. Developing generic automation would accelerate the verification process, making it more cost-effective and enabling greater industrial and academic adoption.

Overall objectives: The main objective of the project was to address the above lack of generic proof-automation methods within ITPs by training a machine learning algorithm to learn proof strategies embedded in a vast ITP library, thereby creating a generic method to automate the Isabelle proving process. The plan consisted of achieving three concrete research objectives:
1. Training a machine learning model that suggests proof methods given an ITP proof state. The training data consisted primarily of data extracted from Isabelle’s Archive of Formal Proofs (AFP), which contains over 298,700 theorems.
2. Creating a proof method in the ITP that seamlessly integrates the model's suggestions into Isabelle. This is the first proof method based solely on a machine-learning model integrated into Isabelle that completes mechanised proofs.
3. Measuring the model's performance on various benchmark problems and comparing it directly to established, powerful methods like Isabelle’s Sledgehammer tool.

Pathway to impact: The project was highly interdisciplinary as it combined machine learning (ML) and formal methods, two relatively distant areas of computer science. It was innovative because it resulted in the first machine learning based proof method fully integrated into the Isabelle proof assistant. The project’s expected impacts are significant and multi-faceted, reaching scientific, societal, and economic domains:
- Scientific & AITP Community: The training algorithms for the project's models are expected to serve as an extensible basis for other researchers and represent a new data point showcasing the generality of machine learning for theorem proving while also exposing its limitations. The project's evaluation algorithms can also serve as a basis for testing the ITP's proof methods.
- Verification Community (Engineering/Industry): The project's generated proof methods are the basis for integrating more powerful machine learning methods into the verification community. It is expected that the final result evolving from these kinds of proof methods will serve as a frequently used tool for creating safe and secure software or hardware. This is particularly relevant for the formal verification of safety-critical systems, such as those involving cyber-physical systems. In time, tech companies that employ proof engineers are expected to see a corresponding increase in productivity.
- Mathematical Community: The project is a step towards accelerating the certification of new mathematical results. If future machine learning based proof methods can prove 30% of the project's benchmark, they are likely to be adopted by most users, including many future mathematicians.
- Broader impact: Overall, the project was a stepping stone in the widespread adoption of formal proofs, which is a significant opportunity to democratise science. Formal proofs enable anyone, regardless of their gender, race, religion, or age, to participate equally and indisputably in scientific endeavours.

Activities performed:
1. Reproduction: The project trained small models for various simple machine learning tasks, reproducing various well-known results for getting acquainted with the most popular libraries for machine learning.
2. Data extraction: The implementation of the Isabelle proof assistant was studied, and key parts of its source code were used to create an algorithm that mines relevant data for training machine learning models.
3. The project reused the Scala and Python programming languages' read-eval-print-loops (REPLs) to interface the Isabelle proof assistant with the most popular libraries for training machine learning models.
4. Training of large language models: The project trained various machine learning models (based on the transformer architecture) on the generated data. The models were trained to predict the next token in proofs modelled as sequences of user actions (strings).
5. Evaluation of the models: An evaluation algorithm that interfaces the models with the Isabelle ITP was implemented. This enabled the models to automatically prove theorems. The best model was able to prove 22% of a test set of approximately 38500 lemmas.
6. Implementation of proof methods: The project produced various commands executable from the Isabelle ITP to query the trained models. Additionally, these commands can also connect with general-purpose models hosted on different servers.
7. The project's libraries served for creating tools that sketch the main structure of a proof, and that fix common, trivial errors when iteratively developing formal verifications.
8. The machine learning based methods produced in the project were compared with traditional proof automation methods. The results evidence the prototypical state of the machine learning methods as opposed to the well-established Isabelle tools. Yet, there are many possible improvements to the technology.

Project achievements:
- Published articles
1. Chengsong Tan, Alastair F. Donaldson, Jonathan Julián Huerta y Munive, John Wickerson. The Burden of
Proof: Automated Tooling for Rapid Iteration on Large Mechanised Proofs, Formalise 2025, Ottawa, Canada,
2025, IEEE/ACM, pp. 34-45. https://doi.org/10.1109/FormaliSE66629.2025.00010
2. Leonardo Lima, Jonathan Julián Huerta y Munive, Dmitriy Traytel. (2025). WhyMon: A Runtime Monitoring
Tool with Explanations as Verdicts. ATVA 2024. LNCS volume 15055. Springer. https://doi.org/10.1007/978-
3-031-78750-8_4
3. Leonardo Lima, Jonathan Julián Huerta y Munive, Dmitriy Traytel. Explainable Online Monitoring of Metric
First-Order Temporal Logic. TACAS 2024. LNCS, vol 14570. Springer. https://doi.org/10.1007/978-3-031-
57246-3_16
4. Jonathan Julián Huerta y Munive, Simon Foster, Mario Gleirscher, Georg Struth, Christian Pardillo Laursen
and Thomas Hickman. IsaVODEs: Interactive Verification of Cyber-Physical Systems at Scale. Journal of
Automated Reasoning, 68(21), 2024. https://doi.org/10.1007/s10817-024-09709-2

- Tutorial presentations:
1. Verifying Cyber-Physical Systems with IsaVODEs. PLDI conference, Seoul, 2025.
2. Machine Learning for the Isabelle Proof Assistant. CICM conference, Brasilia, 2025.

Results:
The project produced an algorithm using the Isabelle proof assistant to generate datasets from existing formalisations
It also trained several (large) language models (LLMs) to predict the next user action and evaluated their capability to prove mathematical statements
It created prototypical tools that enable using LLMs for proof automation inside the Isabelle proof assistant
It developed a proof-fixing tool that was successfully used in a large verification project. The tool automatically fixes almost-correct Isabelle formalisations that initially were correct but fail due to the iterative process of extending their scope. The verification might not have been possible (or completed as fast) without the project's tool

Further research:
The next steps involve extending the project's outputs into a full reinforcement learning framework that enables the open training of models via such an approach
It is also necessary to do a rigorous scientific evaluation and comparison of the different learning and proof automation methods
Moreover, further structured optimisations are necessary for several components of the project's algorithms

superfix.png

llm-call.png

Periodic Reporting for period 1 - DeepIsaHOL (Reinforcement learning to improve proof-automation in theorem proving)

Herunterladen Den Inhalt der Seite herunterladen