CORDIS - Resultados de investigaciones de la UE
CORDIS

Countering Confounding Heterogeneity in Phylogenetics through Non-Parametric Analyses of Quartet Split Patterns

Final Report Summary - COUNTERING LBA (Countering Confounding Heterogeneity in Phylogenetics through Non-Parametric Analyses of Quartet Split Patterns)

Phylogenetic tree reconstruction is a central task of modern biology. Phylogenetic trees are necessary for understanding many processes, such as character evolution, speciation, adaption, the causes of evolutionary change, or the relation between genotype and phenotype. Therefore, the assessment of phylogenetic relationships, provides a foundation for the interpretation of all comparative biological data. With substantial advances in sequencing technologies and of computational power within the last 20 years, molecular data for phylogenetic tree inference has increased from single gene analyses of few taxa to phylogenomic analyses comprising hundreds of genes and taxa. However, systematic biases are likely to become more apparent or even dominant with increasing data availability. In such cases, phylogenetic methods may be inconsistent due to their inability to sufficiently account for the evolutionary complexity of genomic data. The consequence is strongly supported but incorrectly resolved phylogenetic relationships. An important source of systematic bias and probably the most frequently cited reason for incorrect placement of taxa in phylogenetic reconstructions is long branch attraction (LBA). LBA can be described as inherent bias due to a combination of long and short evolutionary paths, in which random similarity based on convergent or parallel character changes lead to an artifactual phylogenetic grouping. Although LBA was firstly described more than 30 years ago, no method has yet been developed which is capable of distinguishing phylogenetic signal from pseudo signal caused by systematic bias in an appropriate manner to reliably identify sources of LBA in empirical data sets. An alternative to phylogenetic reconstruction of complete data sets is the divide and conquer principle which unites both classes of phylogenetic reconstruction by dividing overall reconstruction problems into smaller subsets. Divide and conquer approaches anticipate that data subsets can be more easily analysed separately.

The main objective of this project is to develop and evaluate new divide and conquer tree reconstruction algorithms based on specific split patterns (relationships supported by site-patterns of nucleotides or amino acids) and to consider the logics of phylogenetics to successfuly overcome LBA.

Based on this objective topic we developed a new, quartet-based algorithm, called PhyQuart, combining Hennigian logic and Maximum Likelihood (ML) estimation. PhyQuart considers two alternative directions of character evolution along the internal branch of a quartet tree to discern between potentially apomorphic and plesiomorphic split-supporting site-patterns, and ML to estimate the expected number of convergent split supporting site-patterns. This combination of Hennigian logic and ML estimation represents a completely new strategy for the evaluation of sequence data.

Through extensive quartet simulations, including cases with strong branch length differences, we could demonstrate the efficiency of our new approach in detecting phylogenetically informative and conflicting signals and compared its performance to ML alone when there is a small degree of model misspecification using 172,800 single quartet simulations. PhyQuart was successful in the majority of simulated cases even when internal branches were kept very short. The simulations show that the reconstruction success of ML decreases with increasing branch length differences even when there is only very minor model misspecification, whereas the performance of PhyQuart is only slightly affected by more extreme branch length conditions.

The PhyQuart algorithm is implemented in a command line driven software script (PENGUIN) that runs on Windows PCs, Mac OS and Linux operating systems and can be easily implemented into automatic process pipelines. PENGUIN writes information on split support for each possible quartet relationship between four taxa or clans in plain TEXT files. Obtained discrepancies in topological split support of the three possible quartet topologies of a set of four clans are also presented as split network and triangle graphs. A further vector network shows the distribution of best, second best, and third best resolved quartet trees. The software script as well as the corresponding manual and example files can be downloaded from https://github.com/PatrickKueck/Penguin.

In a further study we show that PhyQuart allows the analysis of all quartets of taxa in larger trees, or of defined quartets of mutiple-taxon clans, and therefore provides a new tool for evaluating contradicting signals which can be used to assess the robustness of relationships within a more complex tree. Based on comprehensive simulation and empirical tests, PhyQuart identifies signal in the data where ML is misled by substitution rate heterogeneities. In some simulation cases even showing high support for the correct clan relationship whereas ML fails due to LBA. It can be stated from our performance tests that the higher the PhyQuart observed contradicting signal for possible clan relationships, the more suspicious is the reliability and branch support for a resolved tree or a given a priori assumption. Regardless if defined clans are based on an already reconstructed tree (a posteriori) or by a priori assumptions the PENGUIN software allows the analysis of all quartets for multiple-taxon clans and provides a new tool for evaluating contradicting signals which can be used to assess the robustness of a given hypotheses or of relationships within a more complex tree.

A new supertree algorithm based on single quartet split support values of the new developed PhyQuart algorithm has been developed and implemented in a new software environment called 4BaSAl. The software is command line driven and written in JULIA. Starting from a triplet of sequences, 4BaSAl analysis a set of quartet trees each of which is assumed, on the basis of the analysis, to be the best candidate for the true tree. Beside the reconstruction of complete trees, 4BaSAl can further be used as evaluation tool to analyse the split robustness of internal branch relationships or to identify and re-analyse only suspicious long branched and therefore potentially unreliable taxon relationships of given topologies. 4Basal has been comprehensively tested on simulated 5-, 6-, and 8-taxa tree simulations. Given first results, 4BaSAl is more efficient in finding correct long branch relationships with the PhyQuart split algorithm as with ML. Additional extensive testing and publication of the approach and its performance results is planed in the second half of 2017.