Genomic Computing (GeCo) is a new data-driven basic science for the management of sequence data. GeCo research is based on a simple driving principle: data should express high-level properties of DNA regions and samples, high-level data management languages should express biological questions with simple, powerful, orthogonal abstractions. The essence of this research is to rediscover the simplicity of driving principles in data-driven computing. Along these principles, the GeCo project has built important outcomes:
1. Developing and exploiting a new core model for genomic processed data.
2. Developing new abstractions for querying and processing genomic data, by means of a declarative and abstract query language rich of high-level operations, with the objective of enabling a powerful and at the same time simpler formulation of biological questions w.r.t. the state-of-the-art.
3. Bringing genomic computing to the cloud, within highly parallel, high performance environments; by using new domain-specific optimization techniques, computational complexity is pushed to the underlying computing environment, producing optimal execution which is decoupled from declarative specifications.
4. Providing an integrated repository of open data, available for secondary data use. During the development of the repository, we addressed the design of a unified conceptual model, of an adaptable data integration pipeline, and then solved source-specific data transformation problems due to their very peculiar data formats, providing several foundational models and methods. The current publicly available repository of open data is available at PoliMi with a replica on CINECA.
During GeCo, we contribute to basic science not only in computer science but also from an interdisciplinary point of view (targeting advances in biology and medical science), as we participate to studies for solving biological or clinical problems, of course thanks to multidisciplinary collaborations. This interdisciplinary work inspired a new research targeted to the development of GeCoAgent, a fully integrated, user-centred web platform aimed at empowering end-user competences for using GeCo technology by employing user-friendly interfaces – essentially, dialogic interfaces driving data extraction and analysis supported by a multi-modal dashboard presenting results in a user-friendly way.
With the pandemic outbreak, efforts have been shifted. As we had learned how to perform data integration, collection and search for the human genome, we started the development of a coordinated collection of repositories and tools for viral sequences, and developed a Viral Conceptual Model (VCM) for virus sequences, then ViruSurf, a database that integrates data from the most used sources for depositing viral sequences (GenBank, CogUK, GISAID); we also implemented VirusViz, a search interface and a visual user interface, both accessible at public links, EpiSurf for integration of viral sequences with IEDB (immmune Epitope Database), and ViruClust for aggregated data analysis across viral lineages and in space and time..