CORDIS - EU research results

Development of tools for the analysis of molecular clock and their application for a case of Hepatitis C Virus epidemics with known infection history

Final Activity Report Summary - MOLECULARCLOCK TOOLS (Development of tools for the analysis of molecular clock and their application for a case of Hepatitis C Virus epidemics with known ... history)

The research concentrated in two areas, namely the analysis of molecular evolution of Hepatits C virus (HCV) and the development of statistical methods to test phylogenetic trees.

We used sequences of the HCV E1E2 region to explore methods to estimate the infection events in HCV epidemics. This research was continued beyond the project completion. The method was applied for a very large sequences' data set from almost 300 patients. Our analysis showed that the problems of heterogeneity of evolutionary rates among sites and the presence of selection could be resolved to allow for the use of the molecular clock at a short timescale, provided that the number of calibration sites was sufficient, especially when using the relaxed clock model, which allowed for different mutation rates for different branches in the tree. Using cross-validation, we further showed that the Bayesian method provided highly accurate infection date estimation. This method also offered the opportunity to use molecular clock methods to solve epidemiological and forensic questions. The results demonstrated that methods that avoided the issue of possible positive selection, such as site stripping, did not provide an advantage. However, we also proved that the sites under selection did not carry temporal information.

In a parallel analysis we compared the amino acid composition between the different time point samples of one patient. Positions for which this composition was significantly different between the two time points of a single patient were identified. 23 patients were analysed in total. The sequences under analysis also corresponded to the E1E2 region, including the hypervariable region 1 (HVR1) and the HVR2. Interestingly, we identified a third region which presented similar features to both HVR1 and HVR2, previously described in E2 protein, even though the variability degree was slightly lower. This could be explained by the reduced exposition characterising the antigenic site that was included in this new region, namely mAb7/16b, according to a structural model proposed for E2 protein. The new region was termed HVR4.

Furthermore, several statistical procedures were proposed to test trees and construct confidence sets of topologies. Unfortunately, in some situations these tests gave contradictory results. In addition, some tests had computational problems. For example, the expected likelihood weights test and Swofford-Olsen-Wadden-Hillis were very intensive computationally, while the generalised least-squares (LS) test required the calculation of the covariance matrix and its inverse might be impossible for data sets that contained many closely related sequences. The weighted LS method differed from the generalised LS in that it treated the distances between taxa as if they were independent to allow for a more efficient calculation of the test statistic.

Moreover, we compared the performance of the weighted LS method with the generalised LS and other methods of topology testing that depended on the character data using biological sequences. The first data set we considered was that of mammalian mitochondrial protein sequences. We then proceeded to investigate the size of confidence sets of trees using the eight-taxon data sets constructed from the data sets deposited in the European Molecular Biology Laboratory (EMBL) Align database. We also analysed a data set of a large number of short viral sequences in which testing of alternative phylogenies was vital in including or excluding patients from a hepatitis C outbreak.

We showed that the weighted LS method could provide a computationally efficient approximation to the generalised LS statistic, particularly useful in the exploratory analysis of the size of the confidence sets of trees when assessing the phylogenetic signal in the data, and in case other methods were not available. We provided an example of such a data set, i.e. on deoxyribonucleic acid DNA-DNA hybridisation data obtained from four species of sand dollars with sea biscuit as an outgroup.

The weighted LS method that was developed as part of the project was applied in several analyses of sequences' evolution from a wide spectrum of organisms, ranging from viruses to fish, and thus far resulted in four manuscripts. One was accepted by the time of the project completion, two were submitted and another two were in preparation. In addition, two papers HCV and two papers on evolution were prepared.