European Commission logo
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS

ALgorithms for PAngenome Computational Analysis

Periodic Reporting for period 1 - ALPACA (ALgorithms for PAngenome Computational Analysis)

Período documentado: 2021-01-01 hasta 2022-12-31

Graph-based, instead of sequence-based data structures have decisive benefits with respect to storage, primary analysis, comparison and knowledge extraction when dealing with large, biologically coherent collections of genomes ("pan-genomes"). As a few prominent examples, consider the systematic exploration of the genetic foundations of microbial resistance, the identification of rare diseases, or the complexity of cancer, both on the individual level and on the level of cancer types and subtypes. With genome data rapidly amassing, the urgent need for a shift in computational paradigms, from ordinary sequence-based to graph-based representations of genome collections is no longer deniable: beyond the general acknowledgment of the movement, from which "computational pan-genomics", as a computer science centric area of genomics research emerged, high-impact journals are publishing special genome graph collections thereby recognizing the importance of computer science.

The main objective of the project is leading the paradigm shift from sequence- to graph-based representations of genomes. We will provide new graph-based representations of evolutionarily related collections of genomes, together with the computational operations that implement their practical benefits, which is instrumental for leveraging the potential of the big genome data (and preventing serious congestion of resources). We will obtain decisive improvements in terms of 1) Redundancy reduction and data compression, 2) The convenient highlighting of commonalities and differences, 3) Visualization, and 4) Comprehensive annotation. To amplify the benefits of those improvements we will provide software implementations of quality competitive with sequence-based software packages in terms of computational complexity. The schematic Figure 1 highlights a large set of operations for which efficient and reliable computational frameworks and algorithms are necessary.

In summary, the ITN will raise a new class of researchers who master the complexity of the era of computational pan-genomics, and thus required to bring along an innovative, unique set of skills: being both highly interdisciplinary and multi-specialized, able to address problems ranging from fundamental algorithms and data structures to software development, big data management and analysis, statistics and machine learning, all closely entangled with genomics and genetics, and bioinformatics in general, and able to bring together the academic and industrial sectors engaged in related business. This explains why interdisciplinary and intersectoral training of ESRs is a compelling necessity for the future development of computational pan-genomics.
One of the main goal of the ALPACA ITN project, as defined in the Grant Agreement, is raising a new generation of researchers who are experts in the field of computational pan-genomics with a unique set of skills: being able to address problems in a highly interdisciplinary domains ranging from fundamental algorithms and data structures to software development, big data management and analysis, statistics and machine learn- ing, all closely involved with genomics and genetics, and bioinformatics in general.

As one of the initial milestones of the project, we achieved this goal successfully by recruiting competent Early Stage Researchers for all 14 positions across 13 institutes and companies who best qualified for the positions. The recruitment finished by the end of December 2021 as planned in the Grant Agreement. For the recruitment, we followed the European Charter for Researchers principles and the Code of Conduct for the Recruitment of Researchers. We ensured the recruitment procedure to be open, transparent, and inclusive by publicly advertising all positions through various channels.

All ESRs have be enrolled in a PhD program at their host institution or in a doctoral school their hosts are affiliated with. Each ESR and their supervisor and co-supervisors started to compile their Individual Research Project (IRP) as an outline for their PhD studies including scientific goals, possible methods, and planned secondments. Along with the IRPs, Individual Career Development Plans (ICDPs) have been devised for each ESR concerning about relevant individual course work and activities related to personal and professional development. Most of the IRPs and ICDPs were ready by the end of December 2021.

Since the start of the ALPACA project, in the pursue of fulfilling the main objective of the project, we achieved decisive scientific advances in the field of computational pan-genomics. The results has been disseminated in 35 scientific article and 2 invited talks so far which are published in prominent journals and/or proceedings of top conferences. Among these publications, 15 introduce new software tools. These publications advances underlying theory and/or introducing new software tools or data formats, both in terms of expanding/improving existing concepts, as well as in opening up entirely novel ways. The results spans from constructing, updating, indexing and compressing pan-genome graphs to comparing pan-genome graphs and transforming pan-genome graphs into input amenable to machine learning based applications. In Deliverable D6.8 Collective Scientific Advances Report 1, we iterate over finer objectives and Tasks defined in each Work Package and report the progress in each individual research projects and the results we achieved so far. Together with training and communication and dissemination activities, we are establishing the ALPACA as a brand in the research community: a network of innovative researchers involved in the computational pan-genomics movement.

ALPACA has organised 3 training activities according to the original plan proposed in the Grant Agreement. These events include two Annual Workshops each year and a Summer School during the reporting period. Apart from in-person training agenda, ALPACA holds a Virtual Seminar series inviting distinguished scientists and researchers in the research field. The seminars are taking place online via Zoom and is publicly open to join. The recordings of the seminars are also disseminated on our YouTube channel.

In order to enhance synergy among the individual research tasks and further help to broaden the individual experience of ESRs, ALPACA is implementing the secondment plans and ESRs have started visiting academic and non-academic sectors in the second year of their PhD studies. We expect some minor changes in the secondment plans, compared to one stated in the Grant Agreement, based on individual background and skill sets and the visiting opportunities available at the moment.
We believe that the scientific advances developed so far during the project already extend the state of the art. We aim to expand these advances even more to accomplish remaining Tasks defined in scientific and non-scientific work packages and publish more results until the end of the project. ALPACA will also involve more in dissemination of results particularly with general public through different platforms and further expanding inter- and intra-consortium collaborations. In addition, ALPACA plans to continue its plan for organising more training events in the course of the project.
Illustration of operations to be supported by a pan-genome data structure