Periodic Reporting for period 1 - R4R (R4R: Reproducible Data Analyses for All)
Période du rapport: 2024-01-01 au 2025-06-30
few loose sheets of paper reviewed by a single attentive reader. Most disciplines rely on experimental data that is collected, analyzed,
and presented using powerful computational tools. The scientific adventure hinges on our ability to openly and widely share and
reproduce such results.
The goal of this PoC is to market a tool, R4R, for non-programmer scientists to make their archival work easily reproducible and offer it
to them through a non-expensive licence. Affordable reproducibility is key to independent evaluation of previously published results.
We focus on reproducibility of data analysis pipelines written in R with RMarkdown or Jupyter. Creating a reproducible
environment is hard, labor-intensive and error-prone, and requires expertise that data analysts lack. We propose to use dynamic
program analysis techniques to track dependencies, data inputs, and other sources of non-determinism needed for reproducibility.
R4R will synthesize metadata to generate self-contained, portable, fully reproducible environments, based on Docker images.
We have also collected a large amount of R notebooks from Github repositories, and from the data science competition Kaggle, to evaluate how the tool performs on actual notebooks.
Companies that operate in regulated environments with a high constraint on the certification of their products, for instance the pharmaceutical industry, which also uses R a lot, will also benefit from the R4R tool.