Skip to main content

Machine learning approaches to epigenomic research

Final Report Summary - EPIGENE INFORMATICS (Machine learning approaches to epigenomic research)

To achieve an understanding on how cells work, molecular biologists study chemical sequences from DNA, RNA, and proteins. In the last decade the field has seen a true revolution through the advent of high-throughput sequencing methods: with an ever increasing pace, unprecedented amounts of measurements are generated in labs around the world and arriving in the public domain. While many experimental problems have been solved, molecular biologist are becoming awash with data. The trouble with these innumerable and large data sets is: they are at times hard to interpret and biological insight is not often easily accessible. The current bottleneck lies in the computational exploration of the data sets and the harvesting of the concealed biological information. In this project, our objective was to use and adopt modern machine learning methodologies in order to extract hidden patterns in divers sequencing data sets. One of our starting points is to find a well-defined way to make comparisons between complex measurements in two conditions, e.g. different cell types or a healthy state and a diseased state. In mathematical terms this is treated in the form of statistical hypothesis testing: The challenge is not only to find differences between two measurements - but to find differences that can be attributed to the difference in experimental conditions. Measurement noise and natural variation between individual samples therefore need to be taken into account in the form of replicated experiments. A general framework for the statistical testing on sequencing data is therefore pivotal to answering a divers set of fundamental biological questions:
Any multi-cellular organism consists of many cell types that all contain identical copies of the individual's genomic DNA sequence. Nevertheless, different cell types can exhibit widely varying morphologies and functions. What distinguishes, for instance, a human liver cell from a heart cell is the set of genes that are active in one cell vs the other and also the quantitative scale of this activity in the respective cells. To some extent, the variation in gene expression is explainable by differences in the combinatorial binding of transcription factors. In higher organisms, however, the DNA is wound around so-called histone proteins and packaged tightly into structural units called nucleosomes. The control of gene expression therefore occurs in the dynamical context of granting or denying certain proteins access to the DNA to carry out active processes like transcription. This is accomplished by epigenetic mechanisms, which include chemical modifications of the DNA sequence itself, like DNA methylation or chemical modifications of the histone proteins, for example the modification H3K4me3 (tri-methylation of lysine 4 of histone 3).
Importantly, epigenomic processes appear to be responsible for two seemingly opposing cellular properties - plasticity and stability - enabling differentiation during early embryonic development on the one hand and reliable perpetuation of cell identity over many cell cycles, on the other. Epigenomic effects are therefore fundamental in normal development and erroneous patterns can cause abnormal activation or silencing of genes. As a consequence, many diseases have been associated with altered epigenetic states. For example, aberrant histone modifications have been linked to the development and progression of a variety of human cancers. This new understanding of cancer, which was traditionally seen as a genetic disease, has already led to the development of novel therapeutic approaches targeting the epigenetic machinery.
Despite these advances, the rules of the "epigenetic code" are still barely understood. This is partly due to the high complexity of epigenomic phenomena, which necessitates highly data intense experiments. With routinely used next generation sequencing methods, these experiments have become feasible, opening up new research opportunities, which were unthinkable only a decade ago. With ChIP-Seq (chromatin immuno-precipitation followed by sequencing), it is now possible to examine a multitude of transcription factor binding sites as well as histone modifications simultaneously across the complete genome and to compare them between different cell types. Similarly, BS-Seq (bisulfite-sequencing) allows interrogating the methylation status of the genome at base-pair resolution.
With this project we contribute to the field by developing two new methods for the quantitative comparison of ChIP-Seq and BS-Seq profiles [1,2]. Both methods use a multivariate non-parametric approach that takes biological replicates into account to test for significant differences. Based on quantifying shape changes in signal profiles they overcome challenges imposed by the highly structured nature of the data and the small number of replicates. Our methods are robust, broadly applicable and freely available for other researchers.
We have applied our method, MMDiff, to a ChIP-Seq data set examining the establishment of the important epigenomic mark H3K4me3 by an ‘epigenomic writer’ protein, called Cfp1 [1]. We identified more than 1600 potential target regions of Cfp1 and by using an additional Pol II binding ChIP-Seq data set we show that detected changes in H3K4me3 correlate with changes in gene expression. We also used a number of computational analyses, which allow us to link the detected molecular changes to the observed phenotype of Cfp1 depletion, demonstrating that MMDiff is capable of identifying biologically relevant shape changes in histone methylation pattern.
Similarly, we show that our second method, M3D, is able to detect higher-order changes in DNA methylation profiles [2]. The applied test statistic explicitly accounts for differences in coverage levels between samples, thus handling in a principled way a major confounder in the analysis of methylation data. We performed empirical tests on a number of data sets, which demonstrate that M3D has increased power compared to other methods, and is more robust with respect to coverage and replication levels.
Our results demonstrate the potential of non- parametric kernel methods to lead to novel biological insights from the analysis of ChIP-Seq and BS-Seq data.


[1] Schweikert, G., Cseke, B., Clouaire, T., Bird, A., and Sanguinetti, G. (2013). MMDiff: quantitative testing for shape changes in ChIP-Seq data sets. BMC genomics, 14(1), 826.


[2] Mayo T. R., Schweikert G., Sanguinetti G. (2014). M3D: a kernel-based test for spatially correlated changes in methylation profiles. Bioinformatics, 31(6), 809-16.