Community Research and Development Information Service - CORDIS

ERC

GeCo Report Summary

Project ID: 693174
Funded under: H2020-EU.1.1.

Periodic Reporting for period 1 - GeCo (Data-Driven Genomic Computing)

Reporting period: 2016-09-01 to 2018-02-28

Summary of the context and overall objectives of the project

"Genomic Computing (GeCo) is a new data-driven basic science for the management of sequence data. In GeCo, data express high-level properties of DNA regions and samples, high-level data management languages express biological questions with simple, powerful, orthogonal abstractions. Along these principles, the GeCo project is building important outcomes:

1. Developing a new core model for genomic processed data. There is a need for a simple da-ta model encompassing the diversity of data formats which have been developed in the past, centred on the notion of sample including both genomic information (organized as regions of DNA or RNA) and metadata (generic properties of the sample, including biological and clinical properties and provenance), that should make the data comparable across heterogeneous experiments. This contribution was developed between the application time and the start of the project, and appeared on the Journal Methods, Dec. 2016, ""Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying"". The conceptual model of metadata was presented at the International Conference on Entity-Relationship Approach, Nov. 2017, ""Conceptual modeling for genomics: Building an integrated repository of open data"".

2. Developing new abstractions for querying and processing genomic data, by means of a declarative and abstract query language rich of high-level operations, with the objective of enabling a powerful and at the same time simpler formulation of biological questions w.r.t. the state-of-the-art. We cited our first contribution in the ERC proposal as reference [B2-1], appeared on the journal BioInformatics, Feb. 2015, ""GenoMetric Query Language: A novel approach to large-scale genomic data management"". The language is under continuous development and reported in a new publication, currently under submission. Client-side tools for visualizing genomic data and metadata were published on BMC Bioinformatics 2017, ""Explorative visual analytics on interval-based genomic data and their metadata"". Indexing methods for supporting region-based computations on genomic datasets were published on the journal Information Sciences 2017, ""Indexing Next-Generation Sequencing data"".

3. Bringing genomic computing to the cloud, within highly parallel, high performance environments; by using new domain-specific optimization techniques, computational complexity is pushed to the underlying computing environment, producing optimal execution which is decoupled from declarative specifications. We target open-source cloud computing environments that take advantage of wide developer communities, so that our domain-specific work will leverage the general progress of cloud computing. The main results appeared on IEEE-Transactions on Computers, 2016, ""Framework for Supporting Genomic Operations""; additional results appeared at International Conference on High Performance Computing & Simulation (HPCS), June 2017 ""Scalable Genomic Data Management System on the Cloud"", methods for parallel computing over genomic datasets were published at the Conference on Algorithms and Systems on MapReduce and Beyond, 2017 ""Bi-Dimensional Binning for Big Genomic Datasets"". Work on genomic computing was at the basis for evaluating different cloud frameworks, with one publication delivered between the submission and the start of the project at IEEE-Big Data Conference 2016 ""Evaluating Cloud Frameworks on Genomic Applications"" and a second publication delivered during the project, at the International Conference on Web Engineering 2017 ""Evaluating Big Data Genomic Applications on SciDB and Spark""

4. Providing an integrated repository of open data, available for secondary data use, in accordance with our obligations on ethical issues as discussed in Section B6 of the Description of Action. Jointly with collaborators from Uninettuno, we delivered a method for converting and importing genomic datasets to our GeCo Repository,"

Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far

The major achievements of GeCo during the first 18 month period include the full deployment of GMQL Version 2.0 (R1.2) and associated GDM model (R1.2). This prototype is open for public use through the current deployment at the CINECA supercomputing site; the system is currently supported by a Web-Based interface and a Python library; similar efforts are directed towards R and Galaxy compatibility. Although the repository of open data is targeted for the second reporting period (R4.1), a significant portion of that repository, including about 20 datasets and 15000 samples, has been made available in accordance to the ethics requirements listed in Section B6 of the Description of the Action (as these data are published by TCGA, RoadMap Epigenomic and Encode Consortia and openly licensed for secondary use). The action implementation has been following the workplan, with some activities that were initiated ahead of anticipated times.

Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)

1. Providing the required abstractions and technological solutions for improving the cooperation of research or clinical organizations (i.e., the members of a same research project or international consortium) through federated database solutions, in which each centre will keep data ownership, and queries will move to remote nodes and will be locally executed, thus distributing genomic processing to data.

2. Providing unified access to the new repositories of processed NGS data which are being created by worldwide consortia. Unified access requires breaking barriers which depend both on data semantics and data access, so this work requires both ontological integration and new interaction protocols. Currently, metadata-driven access is supported at each individual repository through specific interfaces; this must be generalized and amplified, and augmented by providing search methods. We also aim at providing user-friendly search interfaces on top of integrated repositories.

3. Promoting the evolution of knowledge sources into an Internet of Genomes, i.e. an ecosystem of interconnected repositories made available to the scientists’ community. The dream is to offer single points of access to world-wide available genomic knowledge, by leveraging on new services, including metadata indexing and domain-specific crawlers, towards the vision of Google-like systems supporting keyword-based and region-based queries for finding genome data of interest available world-wide, by using large storage systems and techniques such as indexing and crawling.
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top