Periodic Reporting for period 4 - GeCo (Data-Driven Genomic Computing)
Période du rapport: 2021-03-01 au 2021-08-31
1. Developing and exploiting a new core model for genomic processed data.
2. Developing new abstractions for querying and processing genomic data, by means of a declarative and abstract query language rich of high-level operations, with the objective of enabling a powerful and at the same time simpler formulation of biological questions w.r.t. the state-of-the-art.
3. Bringing genomic computing to the cloud, within highly parallel, high performance environments; by using new domain-specific optimization techniques, computational complexity is pushed to the underlying computing environment, producing optimal execution which is decoupled from declarative specifications.
4. Providing an integrated repository of open data, available for secondary data use. During the development of the repository, we addressed the design of a unified conceptual model, of an adaptable data integration pipeline, and then solved source-specific data transformation problems due to their very peculiar data formats, providing several foundational models and methods. The current publicly available repository of open data is available at PoliMi with a replica on CINECA.
During GeCo, we contribute to basic science not only in computer science but also from an interdisciplinary point of view (targeting advances in biology and medical science), as we participate to studies for solving biological or clinical problems, of course thanks to multidisciplinary collaborations. This interdisciplinary work inspired a new research targeted to the development of GeCoAgent, a fully integrated, user-centred web platform aimed at empowering end-user competences for using GeCo technology by employing user-friendly interfaces – essentially, dialogic interfaces driving data extraction and analysis supported by a multi-modal dashboard presenting results in a user-friendly way.
With the pandemic outbreak, efforts have been shifted. As we had learned how to perform data integration, collection and search for the human genome, we started the development of a coordinated collection of repositories and tools for viral sequences, and developed a Viral Conceptual Model (VCM) for virus sequences, then ViruSurf, a database that integrates data from the most used sources for depositing viral sequences (GenBank, CogUK, GISAID); we also implemented VirusViz, a search interface and a visual user interface, both accessible at public links, EpiSurf for integration of viral sequences with IEDB (immmune Epitope Database), and ViruClust for aggregated data analysis across viral lineages and in space and time..
Results of the GeCo project include an impressive number of systems: the GMQL language and system, with access interfaces from Python and R and deployments at DEIB, Cineca and the Broad Institute (see: http://www.bioinformatics.deib.polimi.it/geco/?try); the GenoSurf search system, with API and documentation (also available at http://www.bioinformatics.deib.polimi.it/geco/?try); the viral tools ViruSurf, VirusViz, EpiSurf, and ViruClust, some with GISAID-specific versions (see: http://www.bioinformatics.deib.polimi.it/geco/?try_virus).
These results were published in more than 100 articles, including two articles on Nucleic Acid Research (IF 16.9) one article on Genome Biology (IF 13.5) one article on Nature Communications (IF 12.1). three articles on Briefings in Bioinformatics (IF 9.4) five articles in BioInformatics (IF 5.6) and eight Transactions published by IEEE or ACM.
The Workshop "Challenges in Data-Driven Genomic Computing" was held in Como, Villa del Grumello, on March 6-8, 2019 (see: http://www.bioinformatics.deib.polimi.it/geco/?workshop). The Workshop was attended by scholars from all the world (Harvard, UPenn, NUS Singapore, UNIL, EPFL) and health scientists from hospitals, including Istituto Nazionale Tumori and Istituto Mario Negri.
Research was disseminated in about fifty keynotes that have been presented by Stefano Ceri in various prestigious venues, including Dana-Farber Cancer Institute (art of the Harvard School of Medicine), NorthEastern University in Boston, EPFL, ETH Zurich, TU-Berlin, Helsinki University, Duke-NUS Singapore, and many other (see: http://www.bioinformatics.deib.polimi.it/geco/?events).
2. Providing unified access to the new repositories of processed NGS data which are being created by worldwide consortia. Unified access requires breaking barriers which depend both on data semantics and data access, so this work requires both ontological integration and new interaction protocols. Currently, metadata-driven access is supported at each individual repository through specific interfaces; this was greatly improved by providing a single interface and search methods.
3. Providing a replicable method for genomic data management; the method includes data modeling, design, integration, enrichment, publication and search; it has been successfully applied not only to human genomics but also to the viral genome.
4. Providing a novel methodology for closing the gap between end-users (clinicians and biologists) and technology, based on a clear workflow of iterative steps of data extraction and data analysis and on the adoption of user-friendly multimodal technology, orchestrated by a dialogic interface (chatbox) and providing synchronized data visualization panels.