Skip to main content

A federated FAIR platform enabling large-scale analysis of high-value cohort data connecting Europe and Canada in personalized health

Periodic Reporting for period 1 - EUCAN-Connect (A federated FAIR platform enabling large-scale analysis of high-value cohort data connecting Europe and Canada in personalized health)

Reporting period: 2019-01-01 to 2020-06-30

EUCAN-connect is motivated by the promise of personalized prevention and healthcare using rich phenotypic, environmental and molecular (omics) profiles of every individual. To fulfill this promise researchers need access to data that cover a full range of data on lifestyle, demography, laboratory measures, omics and clinical parameters on many human individuals. Collection of these essential data takes many years and costs large sums of money but fortunately, Europe and Canada have many datasets built up from a strong tradition in population-based prospective cohort studies (there are 668 organisations with 2467 collections in BBMRI-ERIC (EU), and 182 in Maelstrom Catalogue (CA),

However, the present barrier to capitalizing on this richness is that these sensitive data are locked in local repositories, lack universally compatible data standards and are difficult to share because of national/local privacy protection and data security requirements. Yet integrated analysis is essential to reach the statistical power needed to elucidate the complex relationships between genetic traits, environment and diseases and to benefit from the content, temporal and geographic diversity in these cohorts.

Therefore the overall objective of EUCAN-Connect is to enable large-scale integrated multi-cohort data analysis for personalized prevention and healthcare by enhancing long-term collaboration between European and Canadian cohorts and research networks, maturing and standardising federated cohort metadata sharing, data deposition and access, data curation/harmonization, and exchange procedures and to facilitate pooled analysis of high-value cohort data on environmental factors and omics measures that affect health over the human life course with the aim to facilitate personalized prevention and treatment of disease.
The project has started by synergizing efforts of existing European and Canadian cohort networks ReACH, LifeCycle, InterConnect, RECAP, BioSHaRE, and BBMRI-LPC, integrating powerful existing approaches and solutions from Maelstrom, BBMRI, DataSHIELD, Obiba, MOLGENIS (etc) engaging many research infrastructures, including EOSC, EGA, ELIXIR, CORBEL, IHEC, IHMC, GA4GH, P3G, and EMIF. Since the start of the project two new cohort exposome networks have joined the EUCAN-connect community, i.e. ATHLETE and LongITools.
Core concept of EUCAN-connect is to make cohort data available into federated analysis networks, i.e. enabling data access and analysis without physically sharing data to ensure no privacy protected data can be accessed. Crucially, EUCAN-Connect adheres to the FAIR principles, i.e. data should be Findable, Accessible, Interoperable and Reusable. Obviously all this is of little value if not widely used, adopted and sustained in long-term collaborations. Progress on each objective is listed below:
Make FINDABLE - enable cohort data discovery down to data item and subject levels. EUCAN-connect is bridging existing major efforts from BBMRI, Maelstrom, and as well as specific catalogues from cohort networks. We have compared all existing models and collaboratively defined a first reference data model and started the developments for the federation of catalogues.

Make ACCESSIBLE - deliver a low-maintenance open data access and process architecture. We therefore expanded on the DataSHIELD platform for federated analysis, developed a new method to ease the installation and data loading by the cohorts via a Docker based system called Coral; created a separate data API that allows other software to also be connected, implemented in MOLGENIS ‘Armadillo’, and expanded DataSHIELD to allow access to large ‘resources’ opening exciting prospects for *omics data analysis in via DataSHIELD.

Make INTEROPERABLE: accelerate data harmonisation, retrospectively mapping cohort data to standard variables to enable pooled analysis. We focussed on harmonising this process between cohort networks, i.e. how to best 1) explore the study-specific data and samples available; 2) evaluate harmonization potential across studies; 3) process study-specific data under a common (i.e. harmonized) format; 4) estimate the quality of the harmonized data generated; and 5) generate the information required to achieve data analysis and properly interpret results including supportive tools.

Make REUSABLE: developing DataSHIELD bioinformatics toolboxes and federated analysis methodologies. This includes successful release of DataSHIELD v5.0 and v6.0; implementation of continuous testing for all functions; systems for interacting with, training and supporting DataSHIELD users (website and forum); and two community meetings and multiple workshops.
Make COLLABORATIONS: promote uptake by the research community at large. We therefore created demonstrator projects and engaged existing analysis projects in LifeCycle, RECAP, ReACH and InterConnect. In particular, we focus on 1) longitudinal life course analyses from early life onwards ; 2) (epi)genomic origins, microbiome and virome adaptations; 3) early-life exposome-related risk factors; and 4) personalized prevention strategies, related to cardio-metabolic, respiratory, musculoskeletal and developmental health and disease.
Make SUSTAINABLE: ethical and legal governance and extending capabilities beyond the reach of this project. Therefore, we started development appropriate ethical and legal governance framework; made substantive progress in evidencing expectations in EUCAN-Connect’s stakeholder community; and established a governance advisory group and an ELSI expertise forum has been to help guide EUCAN-Connect and establish its long-term sustainability.
Key advances of EUCAN-Connect include: a sustainable and long-lived European Canadian cohort meta-network; a one-stop-shop to search across existing cohorts and networks world-wide; Mainstreaming of data harmonization protocols; Mainstreaming federated bioinformatics analysis toolboxes; Incorporating omics, in particular epigenetics and microbiome; Harmonizing novel markers of life stressors across European and Canadian cohorts; Building a large support base across a broad range of stakeholder communities; and enabling large scale analysis of cohorts to develop/fine-map personalised risk prediction models. Expected impacts are uncommonly large analyses across 10s-100s of cohorts (11 cohorts as partner, n=298,645 and at least 175 cohorts in partner networks, n=2,494,885) to research and develop strategies for identification of groups and individuals at risk by stratification and personalized prediction models based on differences in markers for (early-)life stressors. Thus, EUCAN-Connect will provide wonderful opportunities to translate research findings into policy recommendations that can address key health and social-care challenges and improve the lives of European and Canadian citizens.