Skip to main content
Ir a la página de inicio de la Comisión Europea (se abrirá en una nueva ventana)
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS

R4R: Reproducible Data Analyses for All

Periodic Reporting for period 1 - R4R (R4R: Reproducible Data Analyses for All)

Período documentado: 2024-01-01 hasta 2025-06-30

Unevaluated science is not worth funding. Gone are the days where a scientific breakthrough could be based on scribbles made on a
few loose sheets of paper reviewed by a single attentive reader. Most disciplines rely on experimental data that is collected, analyzed,
and presented using powerful computational tools. The scientific adventure hinges on our ability to openly and widely share and
reproduce such results.

The goal of this PoC is to market a tool, R4R, for non-programmer scientists to make their archival work easily reproducible and offer it
to them through a non-expensive licence. Affordable reproducibility is key to independent evaluation of previously published results.
We focus on reproducibility of data analysis pipelines written in R with RMarkdown or Jupyter. Creating a reproducible
environment is hard, labor-intensive and error-prone, and requires expertise that data analysts lack. We propose to use dynamic
program analysis techniques to track dependencies, data inputs, and other sources of non-determinism needed for reproducibility.
R4R will synthesize metadata to generate self-contained, portable, fully reproducible environments, based on Docker images.
We have developed r4r, a tool for automated environment reconstruction using dynamic program analysis, which builds a self-contained, shareable, user-inspectable, re-usable artifact as a Docker image. It can make a R notebook reproducible, has a command line interface, and a graphical user interface as well as a plugin for RStudio.

We have also collected a large amount of R notebooks from Github repositories, and from the data science competition Kaggle, to evaluate how the tool performs on actual notebooks.
Our tool, R4R, will make it easier for scientists using the R language and R notebooks to create a reproducible data pipeline. R is particularly widespread in the bioinformatics and epidemiology fields, but also in economics, public policy, or finance. Publishers or conference organizers, and funding agencies are more and more asking to make the data available as well as the data pipelines used to do the research, and our tool will help the researchers to answer those requirements.
Companies that operate in regulated environments with a high constraint on the certification of their products, for instance the pharmaceutical industry, which also uses R a lot, will also benefit from the R4R tool.
Checking the reproducibility of R Markdown notebooks
The R4R architecture
Mi folleto 0 0