Final Report Summary - ALMOND (Algorithms and Tools for Mining Biological Sequence Data)
The genral goal of ALMOND project was to perform an interdisciplinary research training in the area of bioinformatics algorithms, with focus on modern computational methods for processing biosequence data. Bioinformatics is one of the flourishing research fields at the Computer Science Department of Ben-Gurion University (BGU) in Israel, including eight faculty members and several active growing collaborations with biological partners. At the national scale, Israel is one of the leading countries in both bioinformatics research and sequence algorithms, with several world-class research groups led by distinguished scientists. The main rationale for this project was then to combine the expertise of the fellow with the local expertise and research environment, in order to extend the trainee's range of competences to new research areas and increase the actual expertise of both the fellow and the host. The project was an opportune follow-up of recently emerged French-Israeli collaboration that had already led to first results and opened up new research directions to explore.
More specifically, the goal of this research training was to define appropriate combinatorial models, efficient algorithms and software tools for several major problems in computational molecular biology. A special attention was put on developing efficient dynamic programming (DP) methods for different core bioinformatics problems, which is a particular area of expertise of the host group. Main scientific results of the project are summarized below.
- New efficient protein sequence comparison algorithms have been designed, taking into account additional biologically driven constraints. A new variant of constrained protein sequence alignment problem has been studied, named Sequence Alignment with Regular Expression Path Constraint (SA-REPC), where the constraint specifies a path in the DP matrix, together with possible sequences forming this path. Two efficient solutions to this problem have been obtained, one for general regular expression and another for a specific class of regular expressions, called "rigid patterns". The algorithms have been implemented in CAlign software that can be downloaded or queried from the host group's website http://www.cs.bgu.ac.il/~negevcb/CAlign(se abrirá en una nueva ventana). This work has been done as a part of PhD thesis in the host group, co-supervised by the fellow.
- Novel approach to RNA sequence/structure comparison has been proposed, for the case when the RNA sequences appear in a coding region. This situation occurs in many biologically important cases, such as in viruses, or in bacteria (in particular, in relation to translational frameshifting). The proposed algorithm allows not only for a simple comparison of two RNA sequences, but also for a prediction of most likely common ancestor of two orthologous RNAs. Therefore, the method can allow one to reconstruct hidden evolutionary relations between two RNAs, when this relation cannot be established using usual comparison algorithms. The algorithm has been implemented and extensively tested on both simulated and real RNA sequences.
- Important contributions have been made to algorithm design for Next-Generation Sequencing (NGS) data. An new method for read mapping has been designed for the case when the maximal number of "errors" (SNPs or sequencing errors) in a read is bounded by a pre-defined constant. The method uses a "bidirectional" variant of succinct indexes (typically, FM-index), where pattern search can be performed in forward and backward direction alternatively. In this setting, an improved method has been proposed that outperforms existing algorithms, which has been shown both analytically and experimentally on real data.
- A new data structure for storing de Bruijn graphs has been designed. De Bruijn graphs are currently used by most of practical genome assembly methods for NGS data, and its computational processing constitutes a major bottleneck in genome reconstruction. The proposed method allowed 30% to 40% gain in memory space over the best existing method. The method has already been used in a third-party software (Minia). The improvement was achieved through the technique of cascading Bloom filters that turned out to have other important applications, in particular to compressing NGS data. A collaboration with researchers of Tel-Aviv University on this application has been established.
- A new improved technique for computing overlap graphs has been proposed. An overlap graph represents (approximate) overlaps between reads of a given NGS dataset, and provides a basic construction for another approach to genome assembly, complementary to de Bruijn graphs. The proposed algorithm is based on new elaborate filtering techniques, that allow to quickly select NGS reads potentially overlapping with a given read. The filters are a refined version of so-called "suffix filters" proposed earlier for approximate string matching. It has been shown that our method brings an advantage over best existing algorithms for computing overlap graphs.
- Several other bioinformatics problems have been studied during the training. One of them is the problem of phylogenetic fingerprinting, arising, in particular, in predicting regulatory signals in DNA. A promising improvement has been proposed for solving this problem, which is still to be experimentally validated. Another subject is reconstruction of viral quasi-species from NGS data. This important problem can be modeled in graph-theoretic terms and raises interesting computational questions.
Overall, the above results have been published in four journal papers and three papers in international conference proceedings, appeared during the period of the project. Among them, two journal papers (Jounral of Computational Biology, Journal of Discrete Algorithms) and two conference presentations (Combinatorial Pattern Matching, String Processing and Information Retrieval) are joint between the fellow and the host team. On top of this, during the training, the fellow (co-)authored two more journal papers and four conference papers presenting results obtained within other collaborations. The results of the training have been also presented to several workshops and seminars, such as a Workshop on Algorithms for Genome Assembly (Bordeaux, May 2013), the 8th Workshop on Compression, Text, and Algorithms (Jerusalem, October 2013). Some of the developed software can be found on the host group's website http://www.cs.bgu.ac.il/~negevcb(se abrirá en una nueva ventana)
As an acknowledgement of the significance of obtained results, the fellow has been invited to present them to several international workshops or schools, including a French-Russian workshop on Algorithms, complexity and applications (Moscow, June 2013), a Workshop on Combinatorial structures for sequence analysis in bioinformatics (Milan, Novembre 2013), the 3rd French-Israeli Workshop on Foundations of Computer Science (Paris, May 2014), or the 7th International School "Computer Science Days" (Ekaterinburg, August 2014). During the fellowship period, the trainee has been invited to a large number of program committees of international conferences (IWOCA 2013, 2014, SPIRE 2014, CPM 2014, CSR 2014, 2015, WABI 2014, PSI 2014, SOFSEM 2013). He has been a co-organizer of the SPIRE Workshop on Algorithmic Analysis of Biological Data (2012), Dagstuhl Workshop on Combinatorics and Algorithmic in Strings (2014).
During the training, the fellow contributed to educational activities in the host university: he co-supervised one PhD student and one master student, gave several plenary talks in the Department that gathered a large audience, of both computer scientists and biologists, including a large number of students. He also gave lectures to students of Bachelor and Master programs. This complies with the fixed goals of educational training.
Importantly, the training provided an opportunity for the fellow to interact with colleagues from other Israeli Universities, namely from Tel-Aviv University, Haifa University and Bar-Ilan University. He made several visits to TAU and Haifa University, and initiated a new joint work with colleagues from TAU.
A number of studies undertaken during the training are subjects of ongoing or future work. To pursue these studies, a joint French-Israeli PICS proposal ("Projets Internationaux de Cooperation Scientifique") has been submitted in June 2014 to French CNRS and Israeli Ministry of Sciences.
In summary, the training objectives of the training have been clearly achieved: the fellow acquired at BGU new deep knowledge of dynamic programming techniques, RNA bioinformatics, virus biology and bioinformatics, sequence algorithms and data structures. His research benefited from extensive interactions with researchers from BGU and other Israeli Universities. Several high-level publications has already resulted from the training, and several others are under preparation. The results have a significant socio-economic impact, as some of the developed methods can be (and actually are) immediately applied to practical analyses of biological data. The fellow got acquainted with the dynamic Israeli educational system producing one of the highest startup creation rate in the world. He received training in teaching and student supervision. Finally, he created or strengthened his collaborations with Israeli communities of bioinformatics and sequence algorithms.
More specifically, the goal of this research training was to define appropriate combinatorial models, efficient algorithms and software tools for several major problems in computational molecular biology. A special attention was put on developing efficient dynamic programming (DP) methods for different core bioinformatics problems, which is a particular area of expertise of the host group. Main scientific results of the project are summarized below.
- New efficient protein sequence comparison algorithms have been designed, taking into account additional biologically driven constraints. A new variant of constrained protein sequence alignment problem has been studied, named Sequence Alignment with Regular Expression Path Constraint (SA-REPC), where the constraint specifies a path in the DP matrix, together with possible sequences forming this path. Two efficient solutions to this problem have been obtained, one for general regular expression and another for a specific class of regular expressions, called "rigid patterns". The algorithms have been implemented in CAlign software that can be downloaded or queried from the host group's website http://www.cs.bgu.ac.il/~negevcb/CAlign(se abrirá en una nueva ventana). This work has been done as a part of PhD thesis in the host group, co-supervised by the fellow.
- Novel approach to RNA sequence/structure comparison has been proposed, for the case when the RNA sequences appear in a coding region. This situation occurs in many biologically important cases, such as in viruses, or in bacteria (in particular, in relation to translational frameshifting). The proposed algorithm allows not only for a simple comparison of two RNA sequences, but also for a prediction of most likely common ancestor of two orthologous RNAs. Therefore, the method can allow one to reconstruct hidden evolutionary relations between two RNAs, when this relation cannot be established using usual comparison algorithms. The algorithm has been implemented and extensively tested on both simulated and real RNA sequences.
- Important contributions have been made to algorithm design for Next-Generation Sequencing (NGS) data. An new method for read mapping has been designed for the case when the maximal number of "errors" (SNPs or sequencing errors) in a read is bounded by a pre-defined constant. The method uses a "bidirectional" variant of succinct indexes (typically, FM-index), where pattern search can be performed in forward and backward direction alternatively. In this setting, an improved method has been proposed that outperforms existing algorithms, which has been shown both analytically and experimentally on real data.
- A new data structure for storing de Bruijn graphs has been designed. De Bruijn graphs are currently used by most of practical genome assembly methods for NGS data, and its computational processing constitutes a major bottleneck in genome reconstruction. The proposed method allowed 30% to 40% gain in memory space over the best existing method. The method has already been used in a third-party software (Minia). The improvement was achieved through the technique of cascading Bloom filters that turned out to have other important applications, in particular to compressing NGS data. A collaboration with researchers of Tel-Aviv University on this application has been established.
- A new improved technique for computing overlap graphs has been proposed. An overlap graph represents (approximate) overlaps between reads of a given NGS dataset, and provides a basic construction for another approach to genome assembly, complementary to de Bruijn graphs. The proposed algorithm is based on new elaborate filtering techniques, that allow to quickly select NGS reads potentially overlapping with a given read. The filters are a refined version of so-called "suffix filters" proposed earlier for approximate string matching. It has been shown that our method brings an advantage over best existing algorithms for computing overlap graphs.
- Several other bioinformatics problems have been studied during the training. One of them is the problem of phylogenetic fingerprinting, arising, in particular, in predicting regulatory signals in DNA. A promising improvement has been proposed for solving this problem, which is still to be experimentally validated. Another subject is reconstruction of viral quasi-species from NGS data. This important problem can be modeled in graph-theoretic terms and raises interesting computational questions.
Overall, the above results have been published in four journal papers and three papers in international conference proceedings, appeared during the period of the project. Among them, two journal papers (Jounral of Computational Biology, Journal of Discrete Algorithms) and two conference presentations (Combinatorial Pattern Matching, String Processing and Information Retrieval) are joint between the fellow and the host team. On top of this, during the training, the fellow (co-)authored two more journal papers and four conference papers presenting results obtained within other collaborations. The results of the training have been also presented to several workshops and seminars, such as a Workshop on Algorithms for Genome Assembly (Bordeaux, May 2013), the 8th Workshop on Compression, Text, and Algorithms (Jerusalem, October 2013). Some of the developed software can be found on the host group's website http://www.cs.bgu.ac.il/~negevcb(se abrirá en una nueva ventana)
As an acknowledgement of the significance of obtained results, the fellow has been invited to present them to several international workshops or schools, including a French-Russian workshop on Algorithms, complexity and applications (Moscow, June 2013), a Workshop on Combinatorial structures for sequence analysis in bioinformatics (Milan, Novembre 2013), the 3rd French-Israeli Workshop on Foundations of Computer Science (Paris, May 2014), or the 7th International School "Computer Science Days" (Ekaterinburg, August 2014). During the fellowship period, the trainee has been invited to a large number of program committees of international conferences (IWOCA 2013, 2014, SPIRE 2014, CPM 2014, CSR 2014, 2015, WABI 2014, PSI 2014, SOFSEM 2013). He has been a co-organizer of the SPIRE Workshop on Algorithmic Analysis of Biological Data (2012), Dagstuhl Workshop on Combinatorics and Algorithmic in Strings (2014).
During the training, the fellow contributed to educational activities in the host university: he co-supervised one PhD student and one master student, gave several plenary talks in the Department that gathered a large audience, of both computer scientists and biologists, including a large number of students. He also gave lectures to students of Bachelor and Master programs. This complies with the fixed goals of educational training.
Importantly, the training provided an opportunity for the fellow to interact with colleagues from other Israeli Universities, namely from Tel-Aviv University, Haifa University and Bar-Ilan University. He made several visits to TAU and Haifa University, and initiated a new joint work with colleagues from TAU.
A number of studies undertaken during the training are subjects of ongoing or future work. To pursue these studies, a joint French-Israeli PICS proposal ("Projets Internationaux de Cooperation Scientifique") has been submitted in June 2014 to French CNRS and Israeli Ministry of Sciences.
In summary, the training objectives of the training have been clearly achieved: the fellow acquired at BGU new deep knowledge of dynamic programming techniques, RNA bioinformatics, virus biology and bioinformatics, sequence algorithms and data structures. His research benefited from extensive interactions with researchers from BGU and other Israeli Universities. Several high-level publications has already resulted from the training, and several others are under preparation. The results have a significant socio-economic impact, as some of the developed methods can be (and actually are) immediately applied to practical analyses of biological data. The fellow got acquainted with the dynamic Israeli educational system producing one of the highest startup creation rate in the world. He received training in teaching and student supervision. Finally, he created or strengthened his collaborations with Israeli communities of bioinformatics and sequence algorithms.