CORDIS - Forschungsergebnisse der EU
CORDIS

Common Infrastructure for National Cohorts in Europe, Canada, and Africa

Periodic Reporting for period 3 - CINECA (Common Infrastructure for National Cohorts in Europe, Canada, and Africa)

Berichtszeitraum: 2022-01-01 bis 2023-06-30

In the past forty years, large human sample cohorts have emerged from research and national healthcare initiatives, with an increase in cohort data generated by researchers and healthcare projects due to personalised medicine programs, marking a shift from research to healthcare-funded genomics. Access to these cohorts is crucial for positive health impacts, but much of the data needed for research is highly sensitive, as data generated in a healthcare context is subject to more stringent information governance than research data, and has to adhere to national data regulations and laws.
In genomics, data federation models facilitate the integration of distributed data, addressing the challenges of data heterogeneity, privacy and scalability. CINECA envisions a federated cloud infrastructure enabling global access to genomics and biomolecular data. CINECA partners represent scientific excellence, and include ten diverse cohorts and scientific projects such as the European Genome-phenome Archive (EGA), CanDIG, and H3Africa, constituting together a virtual cohort of 1.4M individuals from population, longitudinal and disease studies.

Key objectives of the CINECA project:
1- Deliver transcontinental security requirements for data access
2- Provide solutions to ELSI requirements where data cannot move outside a legal jurisdiction
3- Provide federated access to genomic data on demand
4- Deliver access to datasets of the scale and completeness needed to address analytical challenges
5- Provide harmonised metadata, based on open global standards, driving variant and sample discovery in trans-continental virtual cohort of 1.4 million individuals
6- Wide adoption in international personalised medicine projects

The CINECA project has enhanced existing European, Canadian, and African infrastructure for federated cohort discovery and access. This has been evidenced by the creation of a set of federated research and clinical applications, which leverage the enhancements to cohort discovery and access, and are implemented using the latest community standards and workflows to serve as exemplars for future cohort interoperability.
In RP3, we made significant progress towards achieving the project objectives as follows:

- Deliver transcontinental security requirements for data access: CINECA WP2 has provided support to ELIXIR and EGA to implement GA4GH Passport for Authorization and Authentication Infrastructure (AAI), and has been involved in its version update (v1.2) used in a pilot implementation for ELIXIR AAI, and later for the European LifeScience AAI.

- Provide solutions to ELSI requirements where data cannot move outside a legal jurisdiction: CINECA contributed to the development and adoption of GA4GH standards, including the Data Use Ontology (WP3), htsget (WP4), and GA4GH Passport and AAI (WP2). The technical demonstrators in RP3 by WP4 and WP5 showcased how the federated analysis model can be practically applied to real-world research and clinical applications.

- Provide federated access to genomic data on demand: In RP3, CINECA showed how components developed by WP1-3 could be integrated into end-to-end federated analysis frameworks created by WP4 and WP5. For example, the eQTL use case showed how datasets with harmonised metadata (WP3) can be discovered via GA4GH Beacon (WP1), how these datasets can be accessed using LifeScience Login AAI (WP2), and how these datasets can be analysed using the portable and modular workflows in Nextflow.

- Deliver access to datasets of the scale and completeness needed to address analytical challenges: WP1 has built a federated platform for cross cohort Discovery and Extended queries, which enables researchers to discover relevant data for their research. We collaborated with the standard-setting organisation GA4GH, as well as with ELIXIR networks in order to build other implementations of Discovery products such as Beacon, ensuring that our work will form part of the global network for the discovery of human cohort data resources.

- Provide harmonised metadata, based on open global standards, driving variant and sample discovery in trans-continental virtual cohort of 1.4 million individuals: In RP3, GECKO (GEnomics Cohorts Knowledge Ontology) was used to create a set of synthetic cohorts which form the basis for the federated research and clinical application demonstrators from WP4 and WP5.

- Wide adoption in international personalised medicine projects: The International Hundred-K Cohorts Consortium (IHCC), which aims to provide access to hundreds of harmonised cohort data dictionaries across continents, has adopted GECKO as a metadata model for cohort data harmonisation. In addition, a collaboration with the Beyond 1 Million Genome (B1MG) and Genome Data Infrastructure (GDI) projects has resulted in the creation of a new Synthetic Dataset for rare diseases, which can be used to develop new tools to access and analyse data interesting for the Rare Disease community.

The CINECA dissemination strategy has continually been updated throughout the project. During RP3, CINECA provided fully remote learning interventions and dissemination together with more traditional in-person training after the COVID-19 pandemic. CINECA also engaged with various stakeholders on topics of common interest, including GA4GH (genomics standards development and implementation), IHCC (using GECKO as cohort model), ELIXIR, and EUCAN working groups.
In RP3, CINECA has made the following achievements beyond the current state of the art:

- Federated research use-cases: Federated research addresses the challenge of analysing large genomic datasets from diverse sources without moving the data due to volume and security constraints. WP4 developed a federated analysis framework, exemplified by eQTL and Polygenic Risk Score use cases. This framework demonstrates how harmonised metadata (WP3) can be discovered via GA4GH Beacon (WP1), accessed with LifeScience Login AAI (WP2), and analysed using portable, modular workflows (WP4) for cross-cohort research.

- Federated clinical use-cases: The CINECA project has leveraged the secure data federation infrastructure developed in WPs 1-4, delivering personalised health applications like a federated data analytics for a European Biobank, a FAIR data compliant federated biomarker discovery service, FAIR and GDPR compliant diagnostic services, and evidence-based curation support for care planning.

- LMIC benefits-sharing: CINECA work plan foresaw long-term outcomes to benefit CINECA African partners, which has provided important benefits for UCT; examples are the deployment of a H3Africa Beacon or the analyses to evaluate the integration of AAI infrastructures like GA4GH passports. The efforts made during this project will extend beyond CINECA, to enhance data management capabilities, guarantee secure access to resources, promote collaboration among researchers, and support genomics research endeavours across Africa.

- Impact beyond CINECA: We collaborated with international cross cohort projects such as the IHCC, Maelstrom resource, the European Joint Program on Rare Diseases, BBMRI-ERIC, and the Dementia Portal UK to develop the GECKO metadata mapping model, in order to expand its use beyond the CINECA project and ensure its interoperability with existing cohort resources. All of the cohort metadata mappings produced by CINECA are openly available beyond the end of the project.
CINECA project associated cohorts