Final Report Summary - SMAC (Statistical machine learning for complex biological data)
The SMAC project aimed to propose new machine learning approaches dedicated to learn from complex biological data, and to apply them to several important questions in biology and medicine. We obtained several important results, both from the methodological and from the application point of view.
On the methodological side, our general approach was to formulate various problems as statistical and pattern recognition inference problems, and solve them with regularized empirical risk minimization procedures. We have for example proposed a new machine learning formulation for learning from positive and unlabeled data, and demonstrated its state-of-the-art performance. We have developed a new method, called TIGRESS, to estimate gene regulatory networks from gene expression data using sparse regression and randomization, which ranked 2nd at a DREAM challenge on gene network inference. We have also developed a new approach combining automatic segmentation and machine learning with human expert annotation to accurately identify change-points in DNA copy number profiles. We have also focused our attention on new data generated by high-throughput sequencing techniques. This includes the development of a new method (and an R package called flipflop) to predict and quantify isoforms from RNA-seq data, by solving a sparse regression problem over an exponential number of features in polynomial time; a new method (and a Python program called PASTIS) for the reconstruction of the 3D structure of the genome from Hi-C data using a new statistical formulation of the problem ; and a new statistical model (and an R package called ZINBWaVe) to analyse single-cell RNA-seq data. New methods for the integration of heterogeneous data represent another important methodological contribution of this project, including generic methologies to combine empirical risk minimization methods with non-smooth penalties, as well as specific methods to integration DNA mutations in cancer with gene networks, or gene expression levels with DNA methylation.
On the application side, several of the aforementioned methodological developments were performed in close collaborations with biologists and medical doctors focusing on specific applications. In particular, we have contributed to elucidating the regulatory network of early neural crest development; we are engaged in understanding the epigenetic modifications in cancer, and have shown that genome-wide methylation patterns allows to discriminate true recurrences from new cancers in case of relapse; we have shown that the 3D organization of DNA in the nucleus of Plasmodium falciparum, the parasite responsible for malaria, plays an important role in gene regulation, in particular in how the parasite selectively controls the expression of its virulence genes. We have also developed a new approach to predict the effect of a molecule on a cell, based on the chemical properties of the molecule and on the genetic background of the cell, which ranked 2nd in the DREAM8 Toxicogenetics challenge.
On the methodological side, our general approach was to formulate various problems as statistical and pattern recognition inference problems, and solve them with regularized empirical risk minimization procedures. We have for example proposed a new machine learning formulation for learning from positive and unlabeled data, and demonstrated its state-of-the-art performance. We have developed a new method, called TIGRESS, to estimate gene regulatory networks from gene expression data using sparse regression and randomization, which ranked 2nd at a DREAM challenge on gene network inference. We have also developed a new approach combining automatic segmentation and machine learning with human expert annotation to accurately identify change-points in DNA copy number profiles. We have also focused our attention on new data generated by high-throughput sequencing techniques. This includes the development of a new method (and an R package called flipflop) to predict and quantify isoforms from RNA-seq data, by solving a sparse regression problem over an exponential number of features in polynomial time; a new method (and a Python program called PASTIS) for the reconstruction of the 3D structure of the genome from Hi-C data using a new statistical formulation of the problem ; and a new statistical model (and an R package called ZINBWaVe) to analyse single-cell RNA-seq data. New methods for the integration of heterogeneous data represent another important methodological contribution of this project, including generic methologies to combine empirical risk minimization methods with non-smooth penalties, as well as specific methods to integration DNA mutations in cancer with gene networks, or gene expression levels with DNA methylation.
On the application side, several of the aforementioned methodological developments were performed in close collaborations with biologists and medical doctors focusing on specific applications. In particular, we have contributed to elucidating the regulatory network of early neural crest development; we are engaged in understanding the epigenetic modifications in cancer, and have shown that genome-wide methylation patterns allows to discriminate true recurrences from new cancers in case of relapse; we have shown that the 3D organization of DNA in the nucleus of Plasmodium falciparum, the parasite responsible for malaria, plays an important role in gene regulation, in particular in how the parasite selectively controls the expression of its virulence genes. We have also developed a new approach to predict the effect of a molecule on a cell, based on the chemical properties of the molecule and on the genetic background of the cell, which ranked 2nd in the DREAM8 Toxicogenetics challenge.