CORDIS - Résultats de la recherche de l’UE
CORDIS

Data-Driven Genomic Computing

Periodic Reporting for period 4 - GeCo (Data-Driven Genomic Computing)

Période du rapport: 2021-03-01 au 2021-08-31

Genomic Computing (GeCo) is a new data-driven basic science for the management of sequence data. GeCo research is based on a simple driving principle: data should express high-level properties of DNA regions and samples, high-level data management languages should express biological questions with simple, powerful, orthogonal abstractions. The essence of this research is to rediscover the simplicity of driving principles in data-driven computing. Along these principles, the GeCo project has built important outcomes:

1. Developing and exploiting a new core model for genomic processed data.

2. Developing new abstractions for querying and processing genomic data, by means of a declarative and abstract query language rich of high-level operations, with the objective of enabling a powerful and at the same time simpler formulation of biological questions w.r.t. the state-of-the-art.

3. Bringing genomic computing to the cloud, within highly parallel, high performance environments; by using new domain-specific optimization techniques, computational complexity is pushed to the underlying computing environment, producing optimal execution which is decoupled from declarative specifications.

4. Providing an integrated repository of open data, available for secondary data use. During the development of the repository, we addressed the design of a unified conceptual model, of an adaptable data integration pipeline, and then solved source-specific data transformation problems due to their very peculiar data formats, providing several foundational models and methods. The current publicly available repository of open data is available at PoliMi with a replica on CINECA.

During GeCo, we contribute to basic science not only in computer science but also from an interdisciplinary point of view (targeting advances in biology and medical science), as we participate to studies for solving biological or clinical problems, of course thanks to multidisciplinary collaborations. This interdisciplinary work inspired a new research targeted to the development of GeCoAgent, a fully integrated, user-centred web platform aimed at empowering end-user competences for using GeCo technology by employing user-friendly interfaces – essentially, dialogic interfaces driving data extraction and analysis supported by a multi-modal dashboard presenting results in a user-friendly way.

With the pandemic outbreak, efforts have been shifted. As we had learned how to perform data integration, collection and search for the human genome, we started the development of a coordinated collection of repositories and tools for viral sequences, and developed a Viral Conceptual Model (VCM) for virus sequences, then ViruSurf, a database that integrates data from the most used sources for depositing viral sequences (GenBank, CogUK, GISAID); we also implemented VirusViz, a search interface and a visual user interface, both accessible at public links, EpiSurf for integration of viral sequences with IEDB (immmune Epitope Database), and ViruClust for aggregated data analysis across viral lineages and in space and time..
The major achievements of GeCo confirm the project planning. Most significant results include the full deployment of GMQL and associated GDM model, together with the open data repository GenoSurf. Collectively, these results produce a significant suite of technological platforms supporting biologists in the tertiary analysis of big genomic datasets. The GMQL prototype is open for public use through the current deployment at the CINECA supercomputing and at the Broad Institute (Cambridge, MA); the repository of open genomic data, used with GMQL, is available at Polimi and CINECA and includes 67 datasets and about 240K files. After the COVID-19 pandemic outbreak, new research activities were opened for studying viral genomes, yielding to the development of repositories and search interfaces for hosting viral sequences. We hosts the largest integrated and curated database of SARS-CoV-2 sequences/mutations in the world, and a number of tools for genome data analysis, with applications to discovery of variant effects on transmissivity and infectivity of the virus and on vaccine escape.

Results of the GeCo project include an impressive number of systems: the GMQL language and system, with access interfaces from Python and R and deployments at DEIB, Cineca and the Broad Institute (see: http://www.bioinformatics.deib.polimi.it/geco/?try); the GenoSurf search system, with API and documentation (also available at http://www.bioinformatics.deib.polimi.it/geco/?try); the viral tools ViruSurf, VirusViz, EpiSurf, and ViruClust, some with GISAID-specific versions (see: http://www.bioinformatics.deib.polimi.it/geco/?try_virus).

These results were published in more than 100 articles, including two articles on Nucleic Acid Research (IF 16.9) one article on Genome Biology (IF 13.5) one article on Nature Communications (IF 12.1). three articles on Briefings in Bioinformatics (IF 9.4) five articles in BioInformatics (IF 5.6) and eight Transactions published by IEEE or ACM.

The Workshop "Challenges in Data-Driven Genomic Computing" was held in Como, Villa del Grumello, on March 6-8, 2019 (see: http://www.bioinformatics.deib.polimi.it/geco/?workshop). The Workshop was attended by scholars from all the world (Harvard, UPenn, NUS Singapore, UNIL, EPFL) and health scientists from hospitals, including Istituto Nazionale Tumori and Istituto Mario Negri.

Research was disseminated in about fifty keynotes that have been presented by Stefano Ceri in various prestigious venues, including Dana-Farber Cancer Institute (art of the Harvard School of Medicine), NorthEastern University in Boston, EPFL, ETH Zurich, TU-Berlin, Helsinki University, Duke-NUS Singapore, and many other (see: http://www.bioinformatics.deib.polimi.it/geco/?events).
1. Providing the required abstractions and technological solutions for improving the cooperation of research or clinical networks (i.e. the members of a same research project or international consortium) through federated database solutions, in which each center will keep data ownership, and queries will move to remote nodes and will be locally executed, thus distributing genomic processing to data.

2. Providing unified access to the new repositories of processed NGS data which are being created by worldwide consortia. Unified access requires breaking barriers which depend both on data semantics and data access, so this work requires both ontological integration and new interaction protocols. Currently, metadata-driven access is supported at each individual repository through specific interfaces; this was greatly improved by providing a single interface and search methods.

3. Providing a replicable method for genomic data management; the method includes data modeling, design, integration, enrichment, publication and search; it has been successfully applied not only to human genomics but also to the viral genome.

4. Providing a novel methodology for closing the gap between end-users (clinicians and biologists) and technology, based on a clear workflow of iterative steps of data extraction and data analysis and on the adoption of user-friendly multimodal technology, orchestrated by a dialogic interface (chatbox) and providing synchronized data visualization panels.
GeCo Group, March 2019