Skip to main content

Bioinformatics and Information Retrieval Data Structures Analysis and Design

Periodic Reporting for period 2 - BIRDS (Bioinformatics and Information Retrieval Data Structures Analysis and Design)

Reporting period: 2018-01-01 to 2019-12-31

Nowadays computers are used to process huge amounts of information. For example, a major search engine processes tens of thousands of searches per second, and more than 4 terabytes of Internet traffic is generated in 1 minute. This volume of information is expected to be multiplied by a factor of 10 in the following 10 years. We are just beginning to be capable of managing this huge amount of information. However, we are not prepared to address the big challenge posed by the explosion of data coming from the DNA sequencing.

Thanks to the falling cost of genome sequencing, biological data doubles every six months. Recent studies estimate that there will be up to 2 billion human genomes sequenced by 2025, increasing five orders of magnitude in 10 years. This increment can be even greater; for instance, for understanding cancer it will be necessary to sequence thousands of cells of the same tumor.

Processing the amount of information that will make personalized medicine a reality, requires a new approach. A new class of data structures has recently been developed to address the new challenges in storing, processing, indexing, searching and navigating biological data. Similar tasks have been also tackled by researchers in the information retrieval community. Synergies of researchers from both fields can lead to new efficient approaches to improve the technology used for analysis of genome-scale data. Only by being capable of processing large amounts of information we will be able to understand diseases or the development, behavior and evolution of species.

The overall goal of BIRDS was to establish a long term international network involving leading researchers in bioinformatics and information retrieval from four different continents, to strengthen the partnership through the exchange of knowledge and expertise, and to develop integrated approaches to improve current ones in both fields. The project achieved all the goals planned regarding research, training and networking activities. The consortium has produced relevant scientific contributions, created new collaborations and enhanced existing ones, and proposed and implemented well-attended training and networking activities.
Regarding the first research task (Algorithms for Sequence Analysis), the results varied from elementary problems or underlying data structures that are in the basis of many bioinformatics tools, such as such finding median strings, characterizing the language of sequence motifs, proposing new methods for the problem of order preserving pattern matching, or obtaining better characterization of link-cut trees and splay trees to more complex tasks, such as discovering signaling pathways by analyzing protein interaction graphs, developing better genome assembly algorithms, or even achieving a better understanding of how the brain works by studying the human brain connectome.

For the second research task (Compression and indexing techniques for repetitive data), the results obtained demonstrated that compact data structures can be applied to exploit repetitiveness in three important application scenarios where such repetitiveness arises: (1) Bioinformatics, (2) spatio-temporal data, and (3) trajectory data. The results involved collaboration between the expertises of the different groups involved in the project, which greatly facilitated such collaborations.

Finally, results of the third research task (Data structures and algorithms for networks analysis) included new compact trip representation over networks, efficient representations of multidimensional data over hierarchical domains, succinct data structures for self-indexing ternary relations, which can be applied to RDF datasets representing information from biological fields, and compressed representations of dynamic binary relations, in particular of graphs and networks that change over time, which can be exploited to dynamic graphs and networks from any field.

We have developed several prototypes that integrate the research results of the consortium. In particular, these prototypes focused on three different use cases: an assembly tool, a trajectory management tool and a network visualization tool. In the first two cases, new easy-to-use and open-source web tools have been developed and made available to the public. The third prototype, a network visualization and management tool, turned finally into the integration of research results into a previously-existing tool. This software is also publicly available as a downloadable tool or as a free online service.
BIRDS has generated new algorithms and data structures, some of them in the shape of products, which have an impact on the research community and can be transferred to the market as complete tools, libraries or frameworks for bioinformatics tasks or in other fields of IR. Some of them are:

- Compressed representations of repetitive sequences and repetitive collections. Compression and indexing of repetitive sequences provide the basis for basic search tools and also building blocks to run complex sequence analysis algorithms in reduced space. Repetitive sequences and collections arise not only in biological data, but also in many other areas of IR. These data structures and algorithms can be exploited as tools in document-based databases or digital libraries and other document repository software, leading to overall improvements in space usage and query efficiency of software tools.

- Compressed representation of trajectories. The storage of positions and trajectories of cars, ships, people, etc. is a usual requirement in many systems, which need to manage very large amounts of data. Efficient compression of trajectories is an active research field, but most of the solutions for efficient representation are limited to the research community. The results obtained can be commercialized and used by airlines, for example, to know how they can optimize routes, where they can save fuel, or which flights had problems on their routes. Information on vehicle trajectories can also be used to track ships, detect fishing in a prohibited area, determine which routes are more popular, or pinpoint those that can be improved.

- Compressed representation of multidimensional data. New algorithms and data structures for the self-indexed representation of multidimensional data provide the basis for advanced information retrieval over aggregated data. Several preliminary research results have already been obtained in this topic.

- Compressed representations of binary and ternary relations, graphs and networks. The representation of binary/ternary relations has many applications in the field of information retrieval, and can be applied in practice as a building block for network analysis and for the representation of sequences in general information retrieval. Specific compressed representations of relations, graphs and networks have been obtained as preliminary research results in the project, and improvements are expected.

- New algorithms for sequence alignment/assembly, and sequence analysis. Sequence analysis and specific tasks such as sequence alignment or genome assembly are essential for bioinformatics research. We have developed new solutions that improve space and/or time efficiency over state-of-the-art solutions, which we expect to be of great impact in the industry.
outreach event
kick-off meeting
summer school
communication of the project in coordinator's day
y3 annual meeting
summer course
networking event
project logo
Dissemination of results
communication of the project
mid-term review meeting