Periodic Report Summary 2 - CAUSALPATH (Next Generation Causal Analysis: Inspired by the Induction of Biological Pathways from Cytometry Data)
The CAUSALPATH project has completed 2.5 years of life. CAUSALPATH’s goals are to advance causal discovery methods improving robustness, learning performance, characterization of confidence in discoveries, applicability to mixed types of data, applicability to collections of heterogeneous datasets in an integrative fashion, as well as the ability to handle feedback cycles, latent confounding variables, and time-course data. In addition, a major goal is to co-evolve these methods for application to mass and flow cytometry data, as well as single cell molecular data in general, using as a system under study the induction of T cell differentiation and corresponding signal pathways. In a series of papers, CAUSALPATH has made progress in all these objectives. Some highlights of the project include the first application of causal discovery methods to mass cytometry, an algorithm for integrative causal analysis of heterogeneous datasets based on conversion to mathematical logic, the implementation of an R package called MXM for causal discovery and feature selection with mixed types of data, and the implementation of an open-architecture tool for (causal) network reconstruction and multivariate analysis of flow and mass cytometry data called SCENERY. CAUSALPATH has organized two related workshops in the subject of computational methods for single cell data called MASSCAUSAL 1 and 2 bringing awareness to the challenges and connecting computational researchers with biologists. New directions for CAUSALPATH along the lines of causality, integration of heterogeneous datasets, and single cell data include improving feature selection (the first step in a causal analysis) in terms of scalability to Big Data as well as finding multiple equivalent solutions, learning causal dynamical systems in the form of Ordinary Differential Equations or Stochastic Differential Equations, and learning latent representations using Deep Learning and other techniques from tens of thousands of publicly available biological datasets.