European Commission logo
English English
CORDIS - EU research results

Integrated human data repositories for infectious disease-related international cohorts to foster personalized medicine approaches to infectious disease research


Data dictionary for harmonised dataset of key clinical-epidemiological variables from EC-funded COVID-19 (cohort) studies that will share data through the COVID-19 Datahubs

Task 82 Reconciliation of clinicalepidemiological dataLead UKHD other beneficiaries and partners involved UCD RIMUHC MaelstromWithn Task 82 we will harness the methodology and synergies developed in ReCoDiD WP3 to apply best practice for the harmonisation of clinicalepidemiological data and advance the state of the art for the harmonisation of participantlevel clinical epidemiological data in the research response to epidemics Harmonisation efforts will focus on the ECfunded COVID19 cohort projects that are collecting primary clinicalepidemiological data but the methodology will be applicable to a broad spectrum of COVID19focused human subject studies ReCoDid WP3 focuses on applying best practice when available and developing new methods when best practice is insufficient or has not been specified for crosscohort prospective and retrospective harmonisation and analysis In WP3 we have focused on prospective Arbovirus data sets available to the investigators in order to test the methodology Task 82 will leverage this expertise to rapidly extend the findings from harmonisation and methodological best practice developed in the context of Arbovirus Zika and dengue virus focused cohorts to COVID19focused longitudinal studies Conducting the actual work of reconciliation of data dictionaries and data harmonisation will involve a substantial additional workload for the existing teams To conduct the work as rapidly as possible while maintaining adherence to the highest standards for the documentation of harmonisation and the quality of the harmonised dataset we will work closely with the larger interconsortium harmonisation working group that was convened by the ReCoDID coordinator to ensure crossfertilisation of ideas from different types of experts and EC Horizon 2020funded projects with diverse fociAs mentioned we will collaborate closely with a partner from Canada Maelstrom Research associated with McGill University to which we were introduced through the EUCAN crossconsortional working group The Maelstrom team brings essential experience to guide the process of applying best practice to the documentation of complex harmonisation decisions Our preexisting work to launch and manage the crossConsortia harmonisation working group and our close ties to Maelstrom will enable us to rapidly assemble the diverse expertise and manage the crossnational collaborative work force necessary to achieve the goals of this projectWell characterised metadata is a fundamental component of FAIRification of existing data Detailed study metadata will be collected from studies that have agreed to share participantlevel data to the COVID19 data hubs Metadata will collect specific information on COVID19 diagnostics and how the diagnostics varied over time and the detailed metadata survey will be informed by the laboratory and field data collection protocols from studies that will contribute data to the ReCoDID platform and COVID19 Data Hubs The metadata survey will produce a detailed dataset that is specific to studies that share participantlevel data to the ReCoDID Platform and COVDI19 Data hubs Agreed outputs of harmonisation at the level of cohort metadata will be captured into the cohort browser as additional searchable annotations to allow discovery across cohorts

Optimised mapping of seroprevalence data according to time, targeted populations and categorised assays

Task 84 Harmonisation of seroprevalence data for COVID19 and generation of a seroprevalence map of EuropeLead AMU other partners involved UCD UKHD EMBLSeroprevalence studies represent a major epidemiological and public health tool They allow evaluating the attack rate in general or specific populations and estimating the immune status and thus the vulnerability to further spread of the outbreak of these populations In particular data from such surveys are needed over time to track the immune status of populations and to determine if and when the prevalence of antibodypositive individuals is reaching a point at which herd immunity can be anticipated Before reaching herd immunity the data can also be used to calibrate mathematical models and inform public health officials about transmission dynamics The methodology of assays has not been harmonised nor completely optimised Different proteins mainly the envelope and the nucleoprotein the latter being more conserved among coronaviruses and different classes of antibodies IgG IgA IgM are targeted by ELISA tests Neutralisation assays unfrequently use the reference PRNT technique which is poorly adapted to large series and cannot be automated When virus neutralisation tests VNT are used in a 96well format the amount of virus varies among investigators most frequently in the range of 50100 TCID50 per assay which produces different neutralising titres In addition different pseudoneutralisation assays have been implemented This is convenient since the assay can be performed outside a BSL3 laboratory However the density of envelope proteins on the surface of pseudoviruses is usually much lower that observed in wildtype virions Consequently low amounts of antibodies can neutralise such pseudoviruses and pseudo neutralisation techniques produce much higher neutralising titres than traditional techniquesOther important sources for heterogeneity between serological surveys are the lack of comparative data between the multitude of both commercial N 230 on the FIND website and inhouse assays and the differences between studies with regard to the design and the choice of the target populations Thus seroprevalence studies may provide a constellation of unrelated pictures in the absence of harmonisation efforts It is unrealistic at this stage to propose a full standardisation of assays Rather what is currently required is i the implementation of a comparator panel that would link the different seroprevalence surveys and would allow proposing a sound dynamic map of seroprevalence across Europe ii an expert analysis of epidemiological and biological study designs to provide skills and scientific content for optimizing data sharing and data comparabilitySpecific subtasks Subtask 841 Situational Analysis to establish in conjunction with the WHO solidarity II serology consortium and from other sources if publicly available a mapping of serosurvey efforts in Europe including the technical details of testing and study designSubtask 842 Toolbox for sound comparison of independently produced seroprevalence data This includes a literature analysis to optimise the characterisation of testing methods ie sensitivity specificity antigens crossreactivity etc reconciling algorithms for heterogenous seroprevalence data and a roadmap for comparative analysisSubtask 843 Networking for implementation of comparator panels for improvement of standardizsation We will contact potential partners Eur Blood alliance Eur Virus Archive WHO using synergies with existing projects see below to promote the implementation of comparator panels to improve the comparability of related assays according to antigen and assay type Synergies The partners are members of the WHO Solidarity II serology consortium which encompasses a large number of ongoing and planned serosurveys in Europe and globallyAt

Report on stakeholder conference 1

Task 621 Stakeholder Conference 1 Cohort initiatives M15Here we will continue the work carried out under task 21 in WP2 where the input from different cohort initiatives is elicited via a structured online process The rationale of this conference is to convene major cohort initiatives after having a structured feedback process about the aims of this project and involve the cohort investigators in the next step of discussion about the aims of the project The stakeholder panel plus the cohort initiatives will be invited to this conference

Statistical approaches to data integration

Task 47 Leveraging high dimensional data for improved predictions and participantlevel inference in the context of a restricted NThe retrieval and linkage of multiple EC and HDL from infectious disease cohorts enables researchers to obtain detailed information on the environment and lifestyle for thousands of individuals The resulting highdimensional data sets may then comprise of different types of molecular measurements such as DNA RNA and protein In this task we will explore how HDL data can meaningfully be combined across cohort studies and be used for risk prediction purposes A key concern is the danger of overfitting which occurs when statistical models overemphasize the associations present in the available cohort data and fail to provide accurate predictions for new individuals It is for instance often possible to develop a model that fits the data perfectly even in the absence of any true association Rigorous penalization and validation is therefore crucial to ensure that developed prediction models are externally valid Additional concerns arise when pooling OMICS data from multiple cohorts First because the statistical power to detect predictive associations tends to increase risk prediction models may become too complex and require a huge number of measurements to provide a personalized risk prediction This may lead to unnecessary costs Second the presence of betweenstudy heterogeneity across cohort studies may substantially affect the external validity of model predictions For instance recent studies have demonstrated that the performance of risk prediction models may vary according to the characteristics of gene expression data sets and that these characteristics can vary within specific diseased populations Heterogeneity in risk predictions may also appear when important causal pathways are ignored or when measurements are of different quality across datasets If heterogeneity in risk predictions is ignored prediction models may have limited applicability and require substantial revision before they can be used in clinical practice In this task we will evaluate and extend statistical methods for dimension reduction and penalization in a metaanalysis context in order to enable the development of generalizable risk prediction models from sparse and heterogeneous samples Hereto we will build upon group LASSO and groupedregularized ridge regression and implement new penalty measures to reduce betweenstudy variation of prediction error

Plan for the use and dissemination of results

Task 73 Data Management Plan and plan for dissemination of resultsWe will also generate a plan for the use and dissemination of results which will be available at the end of the project and will build on the dissemination via the website throughout the project see task 74

Final stakeholder conference and report

Task 6.2.3: Final stakeholder conference and dissemination of resultsThe final stakeholder conference will be organized together with the final consortium meeting and will also include the dissemination of the results (see also task 7.2 and 7.3).

Searchable and navigable human data hubs

Task 41 Develop and implement a standardized design for data hubs for controlled access human dataWe have put in place in the COMPARE project data hubs that serve as sharing points for pathogenderived HDL data supporting rich search and navigation functions with options for prepublication data sharing amongst collaborating groups In order to support cohort studies around infectious disease however we must extend the data hub concept to support omics data of human origin providing the same search navigation and controlled sharing amongst members of collaborating groups but respecting the far deeper security requirements associated with these data Just as we have used the European Nucleotide Archive ENA as the foundation for the pathogen data hubs we will leverage the European Genomephenome Archive as the foundation for human data hubs the ENA provides permanent archiving of data intended albeit after a period of prepublication confidentiality for open public access while the ENA supports controlled access to data according to data access agreements put in place at the time of research subject recruitment and ethical planning processes Data hubs will support the sharing of both primary data as reported by those users generating the data and the outputs of computational workflows for processing such as quality control human read alignment to reference and pathogen isolate assembly as fed into the system by autonomous processes see task 33 To complement functions already in place for pathogen data hubs we will adapt existing interfaces web and programmatic for data reporting search and navigation functions Work in this task will includeextension of model to support controlled access EGAbased data hubs in addition to existing ENAbased data hubsSystems for rapid configuration of controlled access data hubs to fit the urgency of infectious diseasessecurity and authentication systemsdata upload tools andpresentation of data hubs to secure cloud compute

Report on stakeholder conference 2

Task 622 Stakeholder Conference 2 Specific issues when combining low CE and HDL data M36A selected group of cohort investigators from the 1st stakeholder conference to be identified during the 1st stakeholder meeting will participate for this meeting where the focus will be the humanpathogen divide and how to develop strategies to overcome the frequently encountered issue that participantlevel data and high dimensional data are stored curated and searched in different platforms with often poor infrastructure to be combined for construction of relevant clinical phenotypes that can be interrogated from a high dimensional data analysis perspective

Statistical guidance and approaches for dealing with heterogeneity, missing data and measurement error in pooled cohort data sets

Task 33 Reconciling measurements of individual cohort participants across heterogeneous data setsKey problems when combining multiple data sources arise when cohort studies adopt different variable definitions or measurement methods when data are prone to measurement error or when studies are affected by missing data For this reason this work package will develop a statistical framework to simultaneously account for all of the aforementioned sources of uncertainty and bias This framework will integrate stateoftheart methods for dealing with missing data and measurement error and extend them for application in heterogeneous data sets Further new multivariate metaanalysis methods will be developed to reconcile situations where standardization of certain variables is no longer feasible These methods will adopt advanced penalization schemes to facilitate their applicability in sparse and high dimensional data sets Finally we will integrate input from scientific experts ie immunologists virologists statisticians and teams on the ground to ensure that the underlying data generation processes are properly accounted for The proposed framework will adopt a Bayesian estimation paradigm to simultaneously propagate all relevant sources of uncertainty and to adapt model complexity as new participantlevel data covariates andor studies become available Data curation and statistical methods will work in concert to ensure that both the model complexity and findings are based on the most recent dataevidence

Report on the 1st round online survey and interviews related to perceived benefits and risks of sharing among cohort investigators

Task 2.1 : Elaboration of steps needed for cohorts to participate in collaborative, decentralized platform – Perceived risk versus benefits of sharing CE and HDL data in cloud-based, federated repository This task aims to set the scene for the establishment of collaborative and decentralized data sharing platforms. While sharing CE and HDL data is supported and sometimes required by major funding agencies, journals, and other stakeholders, cohort study staff and leadership oftentimes do not see data sharing as a net benefit for themselves and their research agenda. Data sharing concerns related to intellectual property, ownership, authorship, and regulatory barriers need to be fully understood and addressed prospectively. Therefore, the focus of this task is to understand the perspective of the potential participants – individual cohort study staff and investigators. We will develop an assessment tool to elicit input and responses from a broad audience of cohort investigators that range from those that are motivated to share all of their data to those who are unwilling to share CE and/or HDL data. We will focus our assessment on understanding: (i) the perceived risks vs. benefits of sharing data in a cloud-based, federated repository; (ii) the perceived need for automated work flows that provide regular analyses of CE and HDL data and how these can be incorporated into the shared platform; (iii) the perceived need for collaborative analyses that leverage human HDL data for personalized medicine and how these can be facilitated though the shared platform; (iv) the perceived need for research capacity building in management of CE and HDL data and analysis; (v) the perceived need and strategies for sensitizing cohort staff to the utility of new forms of high dimensional data for personalized medicine approaches (i.e. microbiome data). We will use an online survey to collect responses and allow for repeated input during the course of the research project. This will be complemented with in-person meetings and especially with one stakeholder conference (see WP6). This task will be managed back-to back with Task 5.1 dedicated to implementation of a Decentralized Organization. Both tasks will undoubtedly partially overlap, but it is critical to have these issues addressed at both a global unifying level and in the context of the local implementation of an adapted governance to overcome obstacles arising at the local level. To help in this integration process, the same participants will participate in both tasks.

Report on connected data hubs available for deployment

Task 32 Connecting standards across the human and pathogen divide crosstalk with WP4This tasks aims to establish the necessary links between the clinical researchers HDL data providers and data consumers that provide additional analysis In order to link relevant CE data from the cohorts with data and analytical outcomes from HDL data in the data hubs WP4 an agreed set of metadata needs to be sharable This may include details that are not routinely logged in the CE database WP3 or with the biobank WP5 for instance exact timing of the samples and storage conditions as these may affect analytical outcomes Similarly agreed standards are needed for the HDL data and analytical outputs in order to make data sharable The task includes discussions with both clinical and laboratory scientists for some of the main types of analysis that will be done on combined CE and HDL data using case studies and pilot datasets to gain a deep and shared understanding of the types of metadata that optimally is collected and options for sharing it through WP3 4 or 5 in line with the governance developed in WP2

Methodological guidelines and guidance to employ cutting-edge quasi-experiments using pooled cohort data

Task 34 Causal Inference in pooled cohortsProper understanding of causal mechanisms is crucial to deliver a personalized treatment strategy in individuals Adoptions and further developments of methods that have originated in economics political science and psychology have recently substantially expanded the methodological toolkit for causal inference in epidemiology and biostatistics These methods include both nonexperimental innovations which lack the ability to control for unobserved confounding but improve our ability to control for observed confounders including Gcomputation marginal structure models MSM and novel variants of propensity score matching as well as quasiexperimental approaches which have the ability to control for unobserved confounding because they use quasirandom exposure assignments identified in cohort data Some quasiexperimental methods have the ability to completely control for unobserved confounding such as regression discontinuity interrupted time series and instrumental variable analysis while others can only partially control for unobserved confounding such as fixedeffects and differenceindifferences analysisIn this task we will evaluate adapt and extent existing methods for causal inference for use in pooled cohort data These methods when integrated in the development of risk prediction models can directly be used to improve the estimation of personalized treatment response see also Task 47 Based on this work we will develop guidelines and guidance documents for causal inference using pooled cohort data On the one hand we will evaluate and adopt nonexperimental methods such as Gcomputation and MSM for this purpose and we will develop approaches to include causal inference methods in socalled onestage metaanalysis models where all cohorts are simultaneously analyzed rather than in separation for causal inferenceOn the other hand we will develop guidance on identifying quasiexperimental opportunities in pooled cohort data While fixed effects opportunities are often easy to identify and generate because the unit of observation in longitudinal data collection here mostly individuals are immediately discernable in pooled cohort data differenceindifferences instrumental variable and regression discontinuity opportunities typically require integration of information external to the pooled cohort data To increase the use of quasiexperimental approaches in pooled cohort data we will establish conceptual and methodological guidance for identification of analytical opportunities and rigorous application similar to a series that we recently edited in the Journal of Clinical Epidemiology but with specific view to pooled cohort data analysis Pooled cohort analysis offer particularly promising but also challenging opportunities for quasiexperiments Instrumental variables and regression discontinuity thresholds may vary across geography and time requiring approaches that go beyond the current standard analyses to ensure poolability of analyses and results A further set of innovations will be possible through the use of prospectively planned quasiexperiments Often it is not possible to randomize an exposure of interest for legal ethical or political reasons for instance because an intervention is strongly believed to be effective However in these cases it will often be possible to induce an instrumental variable by assigning an exposure that significantly affects that intervention of interest but is highly unlikely able to affect the outcome of interest We will further improve the ability to carry out regression discontinuity analysis through prospective planning and data collection This innovation allows for costefficient regression discontinuity because data collection on specific exposure and outcome can target observations that lie close to the regression discontinuity threshold comp

Report from cognitive interview study to explore cohort participant's understanding of broad consent-related language communicated during the informed consent process.

Task 22 Ethical and cultural concerns regarding data and sample sharing are central to developing valuesbased ethical and appropriate informed consent practices Informed consent is of central importance to the ethical conduct of research principle of respect for persons and is important for enrollment and continuity of participation in cohort studies 1 5 A commonly used and widely accepted measure of consent for data sharing is broad consent wherein consent is obtained at the same time for the primary study as well as for sharing and future use of the data 6 The wording used for broad consent may explicitly state or imply that participants will not be recontacted when sharing their data 6 While several qualitative studies in lowand middleincome country LMIC settings have been conducted on the importance of consent for data sharing and the preferred methods for obtaining consent 1 35 7 very few have been conducted to assess participants understanding of broad consent 2 6 One study conducted in Thailand among clinical trial participants from whom broad consent was obtained showed that they did not clearly understand data sharing and had difficulty recalling the studyrelated information that had been shared with them previously 2 Broad consent for data sharing adds a layer of complexity to the consent process as it involves explaining concepts that are often abstract to the participants 2 We will conduct cognitive interviews with research participants who have consented to their or their childrens participation in longstanding dengue and Zika virus focused cohorts in three countries to explore their understanding of and agreement with the broad consentrelated language in each cohorts respective informed consent form ICF This research responds to the challenges faced by researchers and ethics review committees who must ensure the voluntary and informed participation in the research that they conduct or review

Report on standardized data use agreement

Task 23 Ethical and legal framework and governance for cloud based federated storage and synthesis of laboratory and field data from infectiousdisease related cohortsEquitable and acceptable governance that includes addressing ethical and legal issues at the cohort level is an essential precondition for sharing field and OMICS data With cohort and stakeholder input we will develop a data governance framework that is responsive to the barriers and enablers of data sharing The cloud based platform will allow for the development of cohort specific tiered permissions for different types of data in accordance to local and international standards for human subject research The platform will use existing user authentication and data protection systems to ensure the appropriate use and the integrity of data contributed to the platform Data transfer between cohorts and the cloud based platform will be conducted through secure data transfer protocol following encryption of field and OMICS data We will develop a standardized data use agreement that can be tailored for individual cohorts and a system that can be deployed across cohorts to keep track of the dates and scope of existing data use agreements ongoing and completed analyses and related publications Cohorts will always be able to access their own data and may be afforded priority access to crosscohort data In the federated system requests for data access from multiple sites will be reviewed by a data access committee that includes experts from different infectious disease fields who can assess the scientific statistical and ethical soundness of proposed research as further described in Task 51 Task 23 will provide a unifying framework for the functioning of the panel and the scientific analysis of requests whilst Task 51 will ensure the inclusion and organization of local stakeholders to expedite the process and prevent bottlenecks The data access committee will include representation from cohort staff and the open science community and will be informed by feedback from the participant advisory panel The data access committee will review research proposals in a timely fashion using established criteria that include a commitment to strong ethical principles and will maintain a publicly accessible list of ongoing research projects that includes their submission revision and approval dates to ensure transparency and accountability to both cohort studies and the larger scientific community Projects that use crosscohort field or OMICS data will be required to cite the repository and contributing cohort studies in any publication and will be encouraged to include cohort study staff or principal investigators in analyses interpretation of results and related publicationsThe governance model will draw from similar initiative for the HDL data in the COMPARE model In COMPARE data generators decide what data are uploaded to a data hub typically the HDL data with some CE metadata and can choose who can access the data and for which analyses Conditions for sharing in a data hub are codified through a consortium agreement that specifies code of conduct for data use including details on involvement of data providers and possible benefit sharing such as publications or other uses of the data and analyses New requests for data access are sent to the group of data hub owners Requests for access can also come from groups that analyze rather than provide data termed data consumers In the data hub model the expectation is that analyses outputs are shared in the data hubs and can be accessed and explored by all who have access Data hubs also are configured with some automated analysis workflows visualization software and query tools Hubs are provided with the understanding that the intent is to work towards release of deidentified data in as much as is possible Data will be made available upon request through a PortalA separate agreement

Discoverable COVID-19 cohort and biobank sample data and metadata uploaded in the ‘cohort browser

Task 81 COVID19 cohort metadata in the cohort browserLead EMBL other beneficiaries and partners involved UKHD UCD RIMUHC MaelstromWhile completing Task 81 we will focus on collecting metadata from the existing COVID19 cohorts that have been initiated and will emerge over the next two years by asking their coordinators to participate in the metadata sharing initiative Mapping existing studies with human populations will be a first step towards engaging in individuallevel data sharing with a focus on clinicalepidemiological data as well as OMICS data types We will catalogue and make discoverable nonsensitive metadata from COVID19 cohorts through dissemination of an electronic survey to investigators identified through funders eg EC personal contacts clinical study registries eg clinicaltrialsgov and through extraction of publicly available metadata for studies that do not respond to the survey We plan to advance towards reporting systems through which users will be able to declare and update new cohorts and study groups COVID19 specific repositories of clinicalepidemiological data and biorepositories The source databases that are able to systematically declare their holdings be clearly reported when they become available from the Cohort Cloud see WP4 task 45 These publishable data types ie metadata documentation of analyses and related results will be presented in the Cohort Browser that is under development in WP5 task 52 For serological surveys the biological and epidemiological sources of methodological heterogeneity need to be reconciled in order to generate a meaningful harmonised SARSCoV2 European seroprevalence database that would allow the production of dynamic mapsWherever possible we will collaborate with existing initiatives in order to avoid duplication of effort For example regarding the collection of metadata for COVID19 research projects we will synergise with ECRIN an ECfunded platform on clinical trials We will also work closely with existing eg covid19hgorg or planned initiatives eg httpscovid19crcorg to prevent the proliferation of disparate COVID19 data sharing platforms focused on the collection and harmonisation of COVID19 clinicalepidemiological and omics data types

Report on standards for harmonization / reconciliation of cohort data

"Task 3.1: Standards for harmonization or reconciliation of CE cohort data We will develop standardized guidelines for variable definitions, data dictionaries, and protocols, which is relevant for the prospective as well as retrospective harmonization of data. The rationale for this task is go beyond “prospective” harmonization between cohorts where researchers agree on standardized methods BEFORE the initiation of their cohort projects. We also plan to develop a roadmap how heterogeneous data sets from different existing cohort projects with longstanding collection of data from the past can be pooled for specific analysis questions. 2.1.1 Prospective harmonizationTo facilitate prospective cross-cohort synthesis, we will further develop standardized procedures for data capture and curation, including clinical and demographic variable standards and data dictionaries. We will extend this towards data entry and cleaning, sample collection and laboratory external quality assessment. The use of standard definitions for data capture and management needs to be documented with the respective meta-data. We will base our work on existing CDISC/CDASH standards and develop additional standards when necessary. The investigators are involved in the ongoing harmonized protocols for Zika research and the Individual Participant Data Meta-Analysis (IPD-MA). The experience with these efforts will be integrated into this task. 2.1.2 Retrospective reconciliationWe will develop a roadmap towards synthesizing data across existing cohorts with heterogeneous protocols, variable dictionaries, and associated meta-data. Even if reconciliation will not be possible in all cases, the roadmap will include the definition of rules for minimum data sets that can be pooled for specific questions. The assessment of the expected uncertainties when combining heterogeneous data and meta-data will include statistical reasoning, but the focus will obviously be on the reconciliation of content - analyzing clinical, epidemiological, and laboratory definitions used in the individual cohort studies this consortium has access to. One of the strategies will be to create secondary variables that transport the heterogeneity as their definitions are broader than the initial primary variables. We will leverage our experience with Bayesian hierarchical models and missing data models to build a framework for this transportability. Specifically, an appropriate and data-determined amount of ""borrowing of strength"" will be used in having narrowly-defined variables (i.e., with cohort-specific definitions) contribute to knowledge of broadly-defined variables.We will also reflect on the scientific questions most likely to emerge in outbreak situations (where the original focus of the consortium lies) and suggest specific reconciliation scenarios – for instance with respect to the need for seroprevalence and burden estimations. However, the benefit of the retrospective reconciliation is beyond the outbreak scenarios and the work will be extend on a selection of concrete scientific questions to be selected by the participants. For these questions, we will analyze the influence of the heterogeneities of the primary variables on the outcome of interest. "

Report on recommended language for broad informed consent in prospective studies, and best practice for waiver of consent by ethics committees for retrospective data sharing

Task 24 Report on recommended language for broad informed consent in prospective studies and best practice for waiver of consent by ethics committees for retrospective data sharingDescription and proposed research and workflow we will carry out a review of the legal regulatory and institutional environments of selected countries in the context of infectious disease preparedness and response plans that will inform the development of templates broadly construed and focused on the language and key terms forolanguage used in broad consent for future uses and data sharingorequests for Research Ethics Boards REB waiver for specific informed consent for retrospective use of data for access and sharingImplementation and Methodology Legal regulatory and institutional evaluations Bibliography review on regulatory and procedural standardization Examination of existing documents and their languageterms to propose better templates and empirical research to inform the development of templates when possibleResults expected We will generate a report that proposes appropriate sample language for obtaining informed consent for future prospective data sharing The McMaster team will develop a paper that proposes best practices for broad informed consent and waiver of consent to enable data sharing in partnership with WP2 members and ReCoDID project partner institutions in LMICs for instance the University of Ceara UECE and Fiocruz Foundation in Brazil This paper will leverage empirical research done by WP2 members and the experience of WP1 and WP2 in assisting ReCoDID project partner institutions to obtain waivers of consent to share data with the ReCoDID project and REB approvals in Colombia Nicaragua and Brazil

Cohort-associated connected data hubs - adaptation and fine-tuning

Task 45 Userled secure computational environment and collaborative analysis platformWe will put in place a computational environment the cohort cloud to support userled computational analysis processes both with regard to CE and HDL cohort data see Figure 5 This environment will enable secure access with appropriate access control to data held in the cohort data hubs and will complement the existing autonomous compute system provided in task 44 The cohort cloud will be architected specifically for the infectious diseases workloads and provide the ability work within and across ID cohort study data shared in cohort data hubs For combined CE and HDL data the system will provide users with an ondemand HPC entailing computing power and will use a framework controlling infrastructure allows users to manage and deploy a Tenancy a users own private cloud working environment while at the same time prohibiting them from modifying the data or compute framework around it or applying changes impacting security and the integrity of the overall cloud The cohort cloud will save logs of all the actions taken against all files and executions of any software against the data as transactions with logical validation distributed consensus eg if actor should be allowed access to the data and digital signatures The principles and rules behind the process are similar to the verification work that has been published and widely accepted in the Blockchain technologyFor CE data alone computing needs are reduced and access with reduced internet speed will still enable the cohort researchers to run analysis scripts The benefit of the combined environment is that the security and access settings are maintainedThe cohort cloud environment will make available comprehensive computational tools and applications of relevance to human and pathogen data analysis These will be drawn from sources including the ELIXIR Registry of Tools and Data Services httpsbiotools which holds information on over 10000 tools from across the field of Life Sciences curated by domain experts and the Bioinformatics Toolbox assembled and curated under COMPARE

Cohort-associated connected data hubs linking ‘OMICS and conventional cohort data

Task 42 Design and implement a connected data hubs infrastructure to enable crosscohort OMICS researchWe will build infrastructure to support integration at the level of cohorts of data from pathogen and human sources Provided with tiered security and access control systems these connected data hubs will provide a single modality for discovery and access to HDL data relating to an infectious disease cohort with connectivity to conventional cohort data We will offer web and programmatic interfaces across connected data hubs that allow search and retrieval and will build support for data types of relevance to infectious disease studies such as immune profiling and epigenomics This task will includeintegrated data portal and API operating across human and pathogen data hubstraining materials and deliverysupport for users of the connected data hubs andsupport for project pilots and demonstrators

Report on linkages between study-specific data hubs for omics-data types hosted by EMBL and study clinical-epidemiological data and metadata

Task 83 Linking clinicalepidemiological data with the EMBL data hubs for OMICS high dimensional data for COVID19 cohortsLead EMBL other beneficiaries and partners involved UKHD UCD RIMUHC MaelstromTask 83 focuses on integrating COVID19 clinicalepidemiological data for ECfunded studies with viral and host highdensity omics on the Data Hubs These will leverage the existing COMPARE and SARSCoV2 Data Hub infrastructure for pathogens host Data Hubs from WP4 task 41 and connected Data Hubs from WP4 task 42 With adaptation for both sensitive human research subject sources tissues and primary cell lines and openly sharable sources most transformed cell lines and nonhuman hosts we expect a number of connected Data Hubs across multiple studiesOur work aims to provide clear linkages between clinicalepidemiological and omics data types Crossstudy or studyOpen Science Community sharing of omics and clinicalepidemiological data are subject to very different PEARL barriers Permissions and data storage for these data types may differ both within and across studies depending on study team preferences and subject to national laws and national or local ethics review committee guidance We will use existing infrastructure and leverage longterm investments in EMBL and Maelstrom to facilitate these linkagesPrior work by the COMPARE Consortium demonstrates that the decentralised data hub structure proposed here facilitates data sharing by allowing different levels of sharing that depend on country study type data type data recipient etc which builds data generators confidence in the platform The cloudbased federated platform leverages the significant compute resources needed for standalone analyses of omics data types and other types of analyses eg group Lasso which leverage both clinicalepidemiological and omics data types more accurate and precise individuallevel predictions of the effectiveness of different treatments or the risk factors associated with severe disease outcomes or death As stated earlier analyses and key findings will be considered an essential element of the portals metadata Data generators will work together with the EC to develop an agreed upon approach to applications for data use for restricted data types and the process for applying to access the data and the criteria and timeline for review will be clearly documented to facilitate the use of harmonised data

Launch of unified infectious disease-related cohort data portal

"Task 5.2 : Unified infectious disease-related cohort data portal We will implement a unified data portal providing a single point of access for researchers to comprehensive infectious disease cohorts - from the Consortium and beyond - combining detailed descriptions and direct access to datasets held in the underlying data repositories European Nucleotide Archive (ENA) and European Genome-phenome Archive (EGA), and the respective connected cohort data hubs built upon these repositories. Following an initial back-fill of the portal, new datasets from infectious disease-related cohorts will be automatically identified in each archive or data hub, catalogued in the portal's back-end database and rapidly displayed both through the intuitive web portal interface and well documented programmatic API interface. This integration of data from multiple archive locations has been successfully implemented in previous large projects such as the European Virus Archive ( or more recently the HipSci ( that provide extensive metadata, clear visualisation of the available datasets from a range of archive locations and direct access to underlying data in each archive, including support for batch processing and information on applying for access to managed datasets. For this task we will specifically reuse technical components from the HipSci data portal, developed at EMBL-EBI."

Website of the project

Task 7.4 : External and internal communication/ Website of the project and dissemination via Twitter and Facebook An interactive website will be created and will be available for communication and dissemination of results. This will include extra- and intranet resources. Within the intranet, functionalities will include the possibility to find all the documentation needed for an efficient management of the project, alongside the possibility to share useful documents with the members of the other WPs. This platform- accessible only to the project members- will provide a single starting point to access internal and external resources such as partners contacts and mailing lists, administrative documents (templates, budget plan, calendar, reporting tools), internal procedures, communication tools (logos, brochures, etc.), meetings minutes and useful links. Dissemination of the final results of the project will be detailed by a dissemination plan, which will also lead into the final consortium meeting where we will combine the final stakeholder conference with a dissemination of the results for the scientific public. With regards to external communication, the (public) website will host the main content related to the project such as information about the consortium and partners involved, main objectives, latest news and scientific advances in the field. The project's outputs will be also disseminated within the scientific community and other external stakeholders through several social media, such as Twitter and Facebook.

Data management plan

Task 7.3: Data Management Plan and plan for dissemination of results The data management plan is part of the initiative for open science and as a mandatory deliverable will be coordinated from WP7, but in close collaboration with WP2-5 as the data generated in the project will emerge actually more in the form of a product (the searchable platform for EC and HDL data, with a federated data repository in the background).

Create operational decentralised Local Selection Panels in participating clinical centres

Task 51 Decentralised organisation for the coordinated management of data and biological resourcesSubtask 511 The main steps managed at the local level in clinical research programmes include obtaining clearance from local ethics committees and local regulators enrolling patients and collecting then managing both biological samples and clinicalepidemiological information from enrolees Accordingly optimising local resources implies a convergence between biological resources and clinical data repositories that should be managed backtobackWe propose to implement a decentralised organisation for the management of data and biological resources that will take advantage of some of the concepts tools and processes previously developed for EVA and will adapt them to the specific case of sharing data and biological resources We will generate a catalogue of biological resources and categorised data see next section as a part of an integrated process aiming at promoting sharing of data and biological resources This will allow sustainably associating biobanking and clinical information Requests for accessing data andor biological material will be received via the internet portal and examined by a selection panel including representatives of local stakeholders of the research programme concerned and of the overarching cataloguing organisation The selection panel will also include scientists who can assess the scientific and statistical soundness of requests The composition of the local component of this selection committee will be pragmatically adapted to the local situation to include representatives of the relevant stakeholders such as biobanks clinical researchers and possibly the local ethics committee The role of the panel is to examine the application and either make a direct decision or if necessary provide recommendations or further requests to decision makers eg the scientific committee of the biobank or the local ethics committee if concerned A standardised process and the presence of representatives of local decision makers should expedite this process The selection panel will review research proposals in a timely fashion using established criteria and will maintain a publicly accessible list of ongoing research projects that includes their submission revision and approval dates to ensure accountability to both cohort studies and the larger scientific communityCriteria examined will include the academic or commercial nature of the demand its scientific relevance and degree of priority for using the available resources the categories of data that would be relevantly and safely transmitted to the applicant the regulatory and ethical considerations related to material transfer the respect of reciprocity in data sharing etc In the case additional biological analyses would be requested the panel will also examine which analyses could be performed locally and funded by the applicant or would require transfer of the samples with respect of the Nagoya protocol It will also examine how new data generated by the applicant could be produced in a standardised format and further enrich the local database A document signed by the legal representatives of both parties should govern the process clarify duties and protect rights and ownershipSubtask 512 includes creating implementing and improving the complete procedures and reference documents and creating Local Selection Panel LSP for the sites included in the study It also implies promoting intersite standardisation whilst taking into account the specificity of each local contextSubtask 513 Legal aspects of harmonized biobanking and exchange of biological materialPolicies and Acts that guide biobanking and information sharing are rapidly evolving and represent significant constraints for both originators and applicants sharing data and biological material It is important to provide to the projects partners a safe


Current trends in the application of causal inference methods to pooled longitudinal observational infectious disease studies-A protocol for a methodological systematic review.

Author(s): Heather Hufstedler; Ellicott C. Matthay; Sabahat Rahman; Valentijn M.T. de Jong; Harlan Campbell; Paul Gustafson; Thomas P. A. Debray; Thomas Jaenisch; Thomas Jaenisch; Lauren Maxwell; Till Bärnighausen; Till Bärnighausen
Published in: PLoS ONE, Issue 1, 2021, ISSN 1932-6203
Publisher: Public Library of Science
DOI: 10.1371/journal.pone.0250778

A blank check or a global public good? A qualitative study of how ethics review committee members in Colombia weigh the risks and benefits of broad consent for data and sample sharing during a pandemic

Author(s): María Consuelo Miranda Montoya; Jackeline Bravo Chamorro; Luz Marina Leegstra; Deyanira Duque Ortiz; Lauren Maxwell
Published in: PLOS global public health, 2022, ISSN 2767-3375
Publisher: PLOS global public health
DOI: 10.1371/journal.pgph.0000364

The European Nucleotide Archive in 2021.

Author(s): Carla Cummins; Alisha Ahamed; Raheela Aslam; Josephine Burgin; Rajkumar Devraj; Ossama Edbali; Dipayan Gupta; Peter W. Harrison; Muhammad Haseeb; Sam Holt; Talal Ibrahim; Eugene Ivanov; Suran Jayathilaka; Vishnukumar Balavenkataraman Kadhirvelu; Simon Kay; Manish Kumar; Ankur Lathi; Rasko Leinonen; Fábio Madeira; Nandana Madhusoodanan; Milena Mansurova; Colman O’Cathail; Matt Pearce; Stephane P
Published in: Nucleic Acids Research, Issue 2, 2022, ISSN 0305-1048
Publisher: Oxford University Press
DOI: 10.1093/nar/gkab1051

Dealing with missing data using the Heckman selection model: methods primer for epidemiologists.

Author(s): Muñoz J, Hufstedler H, Gustafson P, Bärnighausen T, De Jong VMT, Debray TPA.
Published in: International Journal of Epidemiology, 2023, ISSN 1464-3685
Publisher: Oxford University Press
DOI: 10.1093/ije/dyac237

Continual updating and monitoring of clinical prediction models: time for dynamic prediction systems?

Author(s): David A. Jenkins; David A. Jenkins; Glen P. Martin; Matthew Sperrin; Richard D Riley; Thomas P. A. Debray; Gary S. Collins; Niels Peek; Niels Peek
Published in: Diagnostic and Prognostic Research, Vol 5, Iss 1, Pp 1-7 (2021), Issue 1, 2021, ISSN 2397-7523
Publisher: Diagnostic and prognostic research.
DOI: 10.1186/s41512-020-00090-3

"""How about me giving blood for the COVID vaccine and not being able to get vaccinated?"" A cognitive interview study on understanding of and agreement with broad consent for future use of data and samples in Colombia and Nicaragua"

Author(s): Lauren Maxwell; Jackeline Bravo Chamorro; Luz Marina Leegstra; Harold Suazo Laguna; María Consuelo Miranda Montoya
Published in: PLOS Global Public Health, Issue 1, 2023, ISSN 2767-3375
Publisher: PLOS Global Public Health
DOI: 10.1371/journal.pgph.0001253

The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing

Author(s): Harrison PW, Lopez R, Rahman N, Allen SG, Aslam R, Buso N, Cummins C, Fathy Y, Felix E, Glont M, Jayathilaka S, Kadam S, Kumar M, Lauer KB, Malhotra G, Mosaku A, Edbali O, Park YM, Parton A, Pearce M, Estrada Pena JF, Rossetto J, Russell C, Selvakumar S, Sitjà XP, Sokolov A, Thorne R, Ventouratou M, Walter P, Yordanova G, Zadissa A, Cochrane G, Blomberg N, Apweiler R.
Published in: Nucleic Acids Research, 2021, ISSN 0305-1048
Publisher: Oxford University Press
DOI: 10.1093/nar/gkab417

The European Nucleotide Archive in 2020.

Author(s): Peter W. Harrison; Alisha Ahamed; Raheela Aslam; Blaise T. F. Alako; Josephine Burgin; Nicola Buso; Mélanie Courtot; Jun Fan; Dipayan Gupta; Muhammad Haseeb; Sam Holt; Talal Ibrahim; Eugene Ivanov; Suran Jayathilaka; Vishnukumar Balavenkataraman Kadhirvelu; Manish Kumar; Rodrigo Lopez; Simon Kay; Rasko Leinonen; Xin Liu; Colman O’Cathail; Amir Pakseresht; Youngmi Park; Stephane Pesant; Nadim Ra
Published in: Nucleic Acids Research, Issue 2, 2020, ISSN 0305-1048
Publisher: Oxford University Press
DOI: 10.1093/nar/gkaa1028

Pre-pregnancy and pregnancy cohorts: a scoping review protocol

Author(s): Lauren Maxwell; Regina Gilyan; Sayali Arvind Chavan; Marwah Al-Zumair; Shaila Akter; Thomas Jaenisch; Thomas Jaenisch
Published in: F1000Research 2021, Issue 1, 2021, ISSN 2046-1402
Publisher: F1000 Research Ltd.
DOI: 10.12688/f1000research.55501.1

Current trends in the application of causal inference methods to pooled longitudinal non-randomised data: a protocol for a methodological systematic review

Author(s): Edmund Yeboah, Nicole Sibilla Mauer, Heather Hufstedler, Sinclair Carr, Ellicott C Matthay, Lauren Maxwell, Sabahat Rahman, Thomas Debray, Valentijn M T de Jong, Harlan Campbell, Paul Gustafson, Thomas Jänisch, Till Bärnighausen
Published in: BMJ Open, 2021, ISSN 2044-6055
Publisher: BMJ Publishing Group
DOI: 10.1136/bmjopen-2021-052969

Clinical prediction models for mortality in patients with covid-19: external validation and individual participant data meta-analysis

Author(s): de Jong VMT, Rousset RZ, Antonio-Villa NE, Buenen AG, Calster BV, Bello-Chavolla OY, et al.
Published in: BMJ, 2022, ISSN 1756-1833
Publisher: BMJ
DOI: 10.1136/bmj-2021-069881

Transparent reporting of multivariable prediction models developed or validated using clustered data: TRIPOD-Cluster checklist

Author(s): Debray TPA, Collins GS, Riley RD, Snell KIE, Van Calster B, Reitsma JB, Moons KGM.
Published in: BMJ, 2023, ISSN 1756-1833
Publisher: BMJ
DOI: 10.1136/bmj-2022-071058

The European Bioinformatics Institute: empowering cooperation in response to a global health crisis

Author(s): Cantelli G, Cochrane G, Brooksbank C, McDonagh E, Flicek P, McEntyre J, Birney E, Apweiler R.
Published in: Nucleic Acids Research, 2021, ISSN 0305-1048
Publisher: Oxford University Press
DOI: 10.1093/nar/gkaa1077

Individual participant data meta‐analysis of intervention studies with time‐to‐event outcomes: A review of the methodology and an applied example

Author(s): Valentijn M.T. Jong, Karel G.M. Moons, Richard D. Riley, Catrin Tudur Smith, Anthony G. Marson, Marinus J.C. Eijkemans, Thomas P.A. Debray
Published in: Research Synthesis Methods, Issue 11/2, 2020, Page(s) 148-168, ISSN 1759-2879
Publisher: John Wiley & Sons
DOI: 10.1002/jrsm.1384

The European Nucleotide Archive in 2019.

Author(s): Amid, Clara; Alako, Blaise T F; Balavenkataraman Kadhirvelu, Vishnukumar; Burdett, Tony; Burgin, Josephine; Fan, Jun; Harrison, Peter W; Holt, Sam; Hussein, Abdulrahman; Ivanov, Eugene; Jayathilaka, Suran; Kay, Simon; Keane, Thomas; Leinonen, Rasko; Liu, Xin; Martinez-Villacorta, Josue; Milano, Annalisa; Pakseresht, Amir; Rahman, Nadim; Rajan, Jeena; Reddy, Kethi; Richards, Edward; Smirnov, Dmitr
Published in: Nucleic Acids Research, Issue 2, 2020, ISSN 0305-1048
Publisher: Oxford University Press
DOI: 10.1093/nar/gkz1063

Propensity-based standardization to enhance the validation and interpretation of prediction model discrimination for a target population

Author(s): de Jong VMT, Hoogland J, Moons KGM, Riley RD, Nguyen TL, Debray TPA
Published in: Statistics in Medicine, 2023, ISSN 1097-0258
Publisher: Wiley
DOI: 10.1002/sim.9817

Adjusting for misclassification of an exposure in an individual participant data meta-analysis.

Author(s): de Jong VMT, Campbell H, Maxwell L, Jaenisch T, Gustafson P, Debray TPA.
Published in: Research Synthesis Methods, 2023, ISSN 1759-2887
Publisher: Wiley
DOI: 10.1002/jrsm.1606

Systematic Review Reveals Lack of Causal Methodology Applied to Pooled Longitudinal Observational Infectious Disease Studies.

Author(s): Hufstedler, H., Rahman, S., Danzer, A. M., Goymann, H., de Jong, V., Campbell, H., Gustafson, P., Debray, T., Jaenisch, T., Maxwell, L., Matthay, E. C., & Bärnighausen, T.
Published in: Journal of clinical epidemiology, 2022, ISSN 0895-4356
Publisher: Elsevier BV
DOI: 10.1016/j.jclinepi.2022.01.008

wAUC: Calculate a weighted AUC or c-statistic, exact or by resampling

Author(s): de Jong VMT
Published in: Github, 2022
Publisher: NA

Multiple imputation of incomplete multilevel data using Heckman selection models.

Author(s): Muñoz J.,Egger M.,Efthimiou O.,Audigier V.,De Jong VMT and Debray TPA.
Published in: Arxiv, 2023
Publisher: NA

Measurement error in meta-analysis (MEMA)—A Bayesian framework for continuous outcome data subject to non-differential measurement error.

Author(s): Campbell, H, de Jong, VMT, Maxwell, L, Jaenisch, T, Debray, TPA, Gustafson, P.
Published in: Research Synthesis Methods, 2022
Publisher: Wiley Online Library
DOI: 10.1002/jrsm.1515

Bayesian adjustment for preferential testing in estimating the COVID-19 infection fatality rate

Author(s): Campbell H; Valpine Pd; Maxwell L; Jong VMd; Thomas Debray; Jänisch T; Gustafson P
Published in: arXiv preprint, Issue 1, 2021
Publisher: arXiv preprint

Bayesian adjustment for preferential testing in estimating infection fatality rates, as motivated by the COVID-19 pandemic.

Author(s): Campbell H, de Valpine P, Maxwell L, de Jong VMT, Debray TPA, Jaenisch T, et al.
Published in: The Annals of Applied Statistics, 2022, ISSN 1932-6157
Publisher: Institute of Mathematical Statistics
DOI: 10.1214/21-aoas1499

Searching for OpenAIRE data...

There was an error trying to search data from OpenAIRE

No results available