Skip to main content

The Epistemology of Data-Intensive Science

Final Report Summary - DATA SCIENCE (The Epistemology of Data-Intensive Science)

DATA_SCIENCE investigated how research data are produced, processed, disseminated and re-used across a variety of research situations. We examined data practices within six realms of research: (1) re-use of data about model organisms to enable cross-species discovery; (2) collection and standardisation of crop phenotypic data to enable research on food security; (3) visualisation of plant data into maps and data models to enable their interpretation; (4) curation of genomic data to support the diagnosis and treatment of cancer; (5) integration of environmental and health data to study the spread of disease; and (6) storage of sensitive health data to facilitate research in medicine and public health. To this aim we formulated a conceptual and methodological framework for the qualitative analysis of “data journeys”: i.e. the conditions under which data can be mobilised and re-used across contexts, thus expanding their value as evidence for different research situations. This methodology influenced the emergent field of data studies and has been widely cited by scholars researching data value and use.

These studies of the daily practices, concerns and needs of researchers using data provided crucial insights on how data can travel and be made “open”. We analysed: the role of security and ethical concerns in the strategies used to integrate data; the ways in which labels, models and visualisation tools used by databases affect the interpretation of data and their use as evidence; the obstacles encountered in mobilising data and the significance of situations in which data are missing, absent or inaccessible; the ways in which research communities and institutions can be organised to take advantage of large datasets and related technologies; and the implications of these findings for contemporary debates over the “reproducibility crisis” and the difficulties in evaluating the quality and reliability of data posted online. We thus produced an overarching understanding of how research data can be managed and re-used, which has informed both the philosophy, history and social studies of science, and scientific and policy decision-making concerning data infrastructures. During the project the PI worked with the European Commission, the European Open Science Cloud, the Research Data Alliance, the Royal Society, the British Academy, national governments and scholarly societies towards developing data infrastructures and implementing Open Science guidelines. She served on the steering committees of many of the data infrastructures studied by the project.

Ultimately, the project succeeded in producing a novel philosophy of data-intensive science, which places data at the centre of scientific inquiry and explains (1) the emergence and impact of data science and “big data” and (2) the implications of these developments for contemporary research. At the core of this view is a relational account of data, which shows how the value of data as evidence depends on the circumstances of their use. This view stands in contrast to the ‘representational’ view of data that so far permeated views on inquiry, and which cannot explain the successes of data science, big data and open data.

This philosophical account has been presented in four books, including an award-winning monograph and a substantive edited volume documenting data journeys across fields, as well as 35 scientific papers (including invited features in Nature, eLife and the Harvard Data Science Review), 14 book chapters, 17 public outreach articles, 5 open data collections, one pilot database, 16 policy reports and position statements, and 4 authoritative reference works (e.g. Stanford Encyclopaedia for Philosophy entry on “Big Data and Scientific Research”). Through 36 keynote lectures, 96 invited lectures and 51 contributed talks, project results were presented to colleagues in library and information sciences, biology, informatics, data science, statistics, computer engineering, biomedicine, oncology and bibliometrics, thus ensuring dialogue with a large variety of audiences.