Periodic Reporting for period 4 - Skye (A programming language bridging theory and practice for scientific data curation)
Reporting period: 2021-03-01 to 2023-02-28
Currently there are no reusable methodologies or practical tools that support provenance for curated databases, forcing each project to start from scratch. Research on provenance for scientific databases is still at an early stage, and prototypes have so far proven challenging to deploy or evaluate in the field. Also, most techniques to date focus on provenance within a single database, but this is only part of the problem: real solutions will have to integrate database provenance with the multiple tiers of web applications, and no-one has begun to address this challenge.
The Skye project showed how to build support for curation into the programming language itself, building on recent research on the Links Web programming language, including advances in language-integrated query, and on provenance and data curation. Links is a strongly-typed language that provides state-of-the-art support for language-integrated query and Web programming. This project builds on Links and other recent language designs for heterogeneous meta-programming to develop a new language, called Skye, that can express modular, reusable curation and provenance techniques. To keep focus on the real needs of scientific databases, Skye will be evaluated in the context of GtoPdb and other scientific database projects. Bridging the gap between curation research and the practices of scientific database curators will catalyse a virtuous cycle that will increase the pace of breakthrough results from data-driven science.
Skye draws on the best ideas developed in cutting-edge research on language-integrated query, Web programming, and heterogeneous meta-programming. Skye will provide dialects, or first-class client language definitions, along with translations that map programs written in one dialect to another, or (as a special case) that perform source-to-source translation on a single dialect, for optimisation or to add functionality such as provenance-tracking. These translations will be available as libraries that can change the behaviour of already-written applications by rewriting code, so scientific database developers using Skye will be able to reuse these features instead of having to reimplement them from scratch or make wholesale changes to existing applications.
- Expressing and optimising client languages. Work in this theme has focused on developing support for provenance in different programming languages. In work reported in PPDP 2016 and a followup SCP 2018 paper we showed how to support provenance in Links via source-to-source translation. We also showed how to adapt this approach to support provenance in the Database-Supported Haskell library (DSH) in a
- Defining modular curation techniques. The main activities regarding this theme have been developing new techniques for provenance and explanation in functional languages with effects (references, exceptions), reported in an ICFP 2017 paper, and developing support in Links for incremental relational lenses (reported in ICFP 2018). In addition, a PhD student (Fehrenbach) working on a related topic has completed a PhD thesis that includes a design for supporting user-definable provenance-tracking in Links, and a publication on these results is in preparation. Toward the end of the project we showed how to support langauge-integrated queries over temporal data (GPCE 2022), making it very easy to support data archiving and advanced queries over historical data. This has led to followup research proposals including one that has already been funded.
- Case studies. Most of the effort early in the project related to this theme has gone towards improving the Links implementation to a state of maturity needed for real applications. Subsequently we completed the planned Links reimplementaion of the Guide to Pharmacology database, which yielded a number of valuable insights, and led to publication in IJDC, one of the main journals on data curation; secondly we identified a need for data curation and temporal data management in metrology with NPL, the UKs national metrology institute, and have since had further discussions with PTB, which is Germany's counterpart to NPL. This work has been presented at an international metrology conference and published in one of its main journals. This work has also led to a successful followup Royal Society industry fellowship.
- Metaprogramming applications. Most of the work in this theme concentrated on using metaprogramming and other advanced functional rpogramming techniques to automatically search for and synthesize "semantics" for programming languages, in the form of desugaring rules from a source language to a core language, presented at OOPSLA 2021 and subsequently Bartha's submitted PhD thesis.
- The first workable approach to synthesizing rules describing the behavior of a programming language from examples, using an implementation as a black box (OOSPAL 2021)
- A new approach to langauge-integrated query that supports mixed set and bag collections, and offers the potential to support many other SQL features (ESOP 2021)
- A workalike reimplementaiton of a majort scientific database in Links, showing significant benefits of language-integrated query (IJDC 2021)
- Extension of Links to suopport temporal databases and queries (GPCE 2022)