Skip to main content

A programming language bridging theory and practice for scientific data curation

Periodic Reporting for period 3 - Skye (A programming language bridging theory and practice for scientific data curation)

Reporting period: 2019-09-01 to 2021-02-28

Science is increasingly data-driven. Scientific research funders now routinely mandate open publication of publicly-funded research data. Safely reusing such data currently requires labour-intensive curation. Provenance recording the history and derivation of the data is critical to reaping the benefits and avoiding the pitfalls of data sharing. There are hundreds of curated scientific databases in biomedicine that need fine-grained provenance; one important example is the IUPHAR/BPS Guide to Pharmacology database (GtoPdb), a pharmacological database developed in Edinburgh.

Currently there are no reusable methodologies or practical tools that support provenance for curated databases, forcing each project to start from scratch. Research on provenance for scientific databases is still at an early stage, and prototypes have so far proven challenging to deploy or evaluate in the field. Also, most techniques to date focus on provenance within a single database, but this is only part of the problem: real solutions will have to integrate database provenance with the multiple tiers of web applications, and no-one has begun to address this challenge.

The Skye project will build support for curation into the programming language itself, building on recent research on the Links Web programming language, including advances in language-integrated query, and on provenance and data curation. Links is a strongly-typed language that provides state-of-the-art support for language-integrated query and Web programming. This project will build on Links and other recent language designs for heterogeneous meta-programming to develop a new language, called Skye, that can express modular, reusable curation and provenance techniques. To keep focus on the real needs of scientific databases, Skye will be evaluated in the context of GtoPdb and other scientific database projects. Bridging the gap between curation research and the practices of scientific database curators will catalyse a virtuous cycle that will increase the pace of breakthrough results from data-driven science.

Skye will draw on the best ideas developed in cutting-edge research on language-integrated query, Web programming, and heterogeneous meta-programming. Skye will provide dialects, or first-class client language definitions, along with translations that map programs written in one dialect to another, or (as a special case) that perform source-to-source translation on a single dialect, for optimisation or to add functionality such as provenance- tracking. These translations will be available as libraries that can change the behaviour of already-written applications by rewriting code, so scientific database developers using Skye will be able to reuse these features instead of having to reimplement them from scratch or make wholesale changes to existing applications.

The Skye project supports a group of PhD students and postdoctoral researchers under the leadership of Dr. Cheney to pursue research on programming language design for integrating Web programming and databases, in aid of scientific data management. Topics for research include:

- Language design: How can we program with first-class client languages (dialects) and translations flexibly and safely?
- Expressing and optimising client languages: How can existing client languages be embedded as dialects and translated to efficient client language code?
- Defining modular curation techniques: How can existing (or new) curation techniques be defined using type-safe translations among dialects?
- Case studies: What are the benefits and costs of using Skye to develop curated scientific databases?
Over the first 2.5 years of the project, progress has been made on each of the four themes above:

- Language design. The main activities regarding this theme have been experiments in reorganizing the existing Links implementation of client languages for databases, for example to disentangle database drivers from the rest of the implementation. In addition, one PhD student has developed a typed intermediate language for Links and focused on the problem of defining extensible datatypes and functions in typed functional programming languages, which is a key ingredient that would enable more first-class extensibility in a range of settings.

- Expressing and optimising client languages. Work in this theme has focused on developing support for provenance in different programming languages. In work reported in PPDP 2016 and a followup SCP 2018 paper we showed how to support provenance in Links via source-to-source translation. We also showed how to adapt this approach to support provenance in the Database-Supported Haskell library (DSH) in a 2018 paper. Most recently, we have generalized the approach in Links to support user-defined provenance translations via type-directed generic programming features that we now plan to add to Links. We have also begun to simplify and generalize the approach to language-integrated queries in Links for example to support set and bag semantics simultaneously, and to support incomplete information (nulls).

- Defining modular curation techniques. The main activities regarding this theme have been developing new techniques for provenance and explanation in functional languages with effects (references, exceptions), reported in an ICFP 2017 paper, and developing support in Links for incremental relational lenses (reported in ICFP 2018). In addition, a PhD student (Fehrenbach) working on a related topic has completed a PhD thesis that includes a design for supporting user-definable provenance-tracking in Links, and a publication on these results is in preparation.

- Case studies. Most of the effort related to this theme has gone towards improving the Links implementation to a state of maturity needed for real applications. Otherwise, this theme was slowed somewhat due to unavailability of personnel, which we have now remedied. We are currently working on a Links implementation of the GtoPDB system and in the process identifying limitations of Links that need to be overcome.
The first half of the project has, by design, focused on building on the existing Links implementation (and on experiments in other languages such as Haskell) to get a feel for what will work and whether to design an entirely new language during the rest of the project, or whether it will suffice to extend Links with a few additional features. Moreover, the first year or so of the project necessarily involved recruiting and upskilling new project staff and PhD students. At this point we are well-positioned to pursue an energetic research agenda integrating the results obtained so far, including the following goals:

- Design and implement an improved core language for Links, making use of prototype features for extensible datatypes and functions and incorporating support for type-directed generic programming
- Modularize the translation stages mapping existing Links embedded languages to core Links
- Extend the Links source language to provide access to the extensibility and generic programming features planned for the core language
- Develop an extensible approach to type inference to facilitate embedding domain-specific languages
- Develop and evaluate support for archiving and provenance tracking in the GtoPDB case study and other case studies