## Final Report Summary - TREEMODELS (Algebraic statistics of general Markov models)

The project had the following four main objectives.

(1) To obtain the full geometric description of statistical models for rooted trees used in phylogenetic analysis (general Markov models) and to identify groups acting on the varieties defining these models such as the permutations arising from label switching.

(2) To apply the results from (1) to statistical inference, understand identifiability of general Markov models, give exact methods for estimation and model selection methods, and improve existing numerical algorithms.

(3) To propose an alternative and simpler phylogenetic tree model for modelling evolution of species in biology - this model would retain the biological interpretation but would avoid statistical deficiencies of general Markov models.

(4) To apply the techniques developed in solving (1,2,3) to algebraic geometry for the study of certain projective varieties.

Items (1) and (4) were completed during the first 24 months of the project and progress was made on items (2) and (3). During the last year further progress was made on items (2) and (3). Specifics of the main achievements follow as well as the description of additional work related to the project but not explicitly mentioned in the original proposal.

The solutions to (1) provided in Zwiernik, Smith (2011) for the binary case and by Allman, Rhodes, Taylor (2013) for discrete random variables were complicated and therefore application for inference very limited. A considerably simpler description is provided in [ARSZ] for models of star trees. The defining inequalities, up to a simple group action, come from log-supermodularity, which in statistics is often referred to as multivariate positivity of order 2. This links into classical statistics literature on positive dependence (Karlin, Rinott 1980) and was developed in [FLSUWZ].

For Item (2) results in [ARSZ] were used to gain a better understanding of existing inference methods and propose new ones. For a simple and popular model a closed-form solution for the maximum likelihood estimation (MLE) problem was found. Indeed this model is a semialgebraic set, whose boundary can be stratified into pieces of various dimensions. Over each stratum the MLE problem is easy to solve. This is now implemented in R and used to evaluate the performance of the EM algorithm (work in progress with S. Hosten, E. Allman and J. Rhodes). In [WMZ] the focus is on star-tree models with additional symmetries. The same authors are now extending this study to general tree models with discrete variables with linear dependencies between variables. In particular, it turns out that for these models (that include binary latent tree models), the one-parameter likelihood function is unimodal.

For Item (3) a basic theory of so called marginal supermodels was developed. Identifiability issues were fully addressed and a natural parametrization for doing inference was determined. Basic simulations showed that these models are significantly more robust in terms of representing evolution compared to standard phylogenetic tree models used in practice. Inference is closely related to the marginal likelihood method when applied to latent tree models. Together with E Riccomagno the fellow is studying this and other method based on M-estimators and moment estimators. It is expected the publication of a journal paper in 2016.

The wok on Item (4) produced intriguing links between statistics and algebraic geometry. In [MOZ] tree cumulants are used in order to show that the secant variety of the Segre variety is locally toric and the local properties of these secant varieties using toric geometry. Similar techniques were applied to other varieties in (Manivel, Michałek 2014) and other papers. A more general overview of statistic-based coordinate changes and how they can be used in algebraic geometry was given in [CCMRZ]. In [MSUZ] the authors developed a theory of exponential varieties that are projective varieties with a distinguished set of positive points and that map bijectively to convex cones. Results about their geometry were directly inferred from the statistical theory of exponential families.

Attainment of milestones:

The two milestones listed for Item (1) were superseded already in 2013 by Allman, Rhodes, Taylor. During the project the fellow extended and greatly simplified those results for some special cases in [ARSZ].

One paper was expected for Item (2) showing how semi-algebraic description enhances statistical inference. The papers [ARSZ] and [WMZ] partly show this. Two further papers are based on two main ideas: positive dependence [FLSUWZ] and linear dependence (joint with G. Marchetti and N. Wermuth).

For Item (3) there was one milestone to achieve. Many results have been obtained and the main paper is still under preparation (joint work with E. Riccomagno).

The two milestones listed for Item (4) were achieved in [CCMRZ] and [ZUR]. Extensions can be found in papers of other authors and in [MSUZ]. Work in progress with M. Michałek will provide further results.

Additional work:

Latent tree models studied in this project are not constrained to discrete data. The Gaussian case, relevant to many applications in machine learning, can be usefully studied with the techniques developed during this project. The paper [DLWZ] deals with the large sample behaviour of the likelihood function for Gaussian latent tree models. Analytical formulas are provided and used for improving model selection procedures.

The connection between Gaussian latent tree models, Brownian motion tree models and linear covariance models is developed in [ZUR] where the authors the likelihood function for linear covariance models and show that, although the likelihood function has multiple modes, with high probability its maximisation looks like convex programming.

For general Gaussian graphical models, in [DZ] the stabiliser of a general chain graph model was computed. This realizes these popular models as transformation families and opens path for constructing equivariant estimators.

Objectives (1) and (2) of this project were linked by a recent paper [ASSZ] where the authors provided the full semialgebraic description of Gaussian latent tree models and proposed a simple method for using this description for model inference. Positive dependence in the context of Markov structures is studied in [FLSUWZ]. The relationship to latent tree models is investigated with S. Lauritzen and C. Uhler in a paper that will be ready for publication at the beginning of 2016.

(1) To obtain the full geometric description of statistical models for rooted trees used in phylogenetic analysis (general Markov models) and to identify groups acting on the varieties defining these models such as the permutations arising from label switching.

(2) To apply the results from (1) to statistical inference, understand identifiability of general Markov models, give exact methods for estimation and model selection methods, and improve existing numerical algorithms.

(3) To propose an alternative and simpler phylogenetic tree model for modelling evolution of species in biology - this model would retain the biological interpretation but would avoid statistical deficiencies of general Markov models.

(4) To apply the techniques developed in solving (1,2,3) to algebraic geometry for the study of certain projective varieties.

Items (1) and (4) were completed during the first 24 months of the project and progress was made on items (2) and (3). During the last year further progress was made on items (2) and (3). Specifics of the main achievements follow as well as the description of additional work related to the project but not explicitly mentioned in the original proposal.

The solutions to (1) provided in Zwiernik, Smith (2011) for the binary case and by Allman, Rhodes, Taylor (2013) for discrete random variables were complicated and therefore application for inference very limited. A considerably simpler description is provided in [ARSZ] for models of star trees. The defining inequalities, up to a simple group action, come from log-supermodularity, which in statistics is often referred to as multivariate positivity of order 2. This links into classical statistics literature on positive dependence (Karlin, Rinott 1980) and was developed in [FLSUWZ].

For Item (2) results in [ARSZ] were used to gain a better understanding of existing inference methods and propose new ones. For a simple and popular model a closed-form solution for the maximum likelihood estimation (MLE) problem was found. Indeed this model is a semialgebraic set, whose boundary can be stratified into pieces of various dimensions. Over each stratum the MLE problem is easy to solve. This is now implemented in R and used to evaluate the performance of the EM algorithm (work in progress with S. Hosten, E. Allman and J. Rhodes). In [WMZ] the focus is on star-tree models with additional symmetries. The same authors are now extending this study to general tree models with discrete variables with linear dependencies between variables. In particular, it turns out that for these models (that include binary latent tree models), the one-parameter likelihood function is unimodal.

For Item (3) a basic theory of so called marginal supermodels was developed. Identifiability issues were fully addressed and a natural parametrization for doing inference was determined. Basic simulations showed that these models are significantly more robust in terms of representing evolution compared to standard phylogenetic tree models used in practice. Inference is closely related to the marginal likelihood method when applied to latent tree models. Together with E Riccomagno the fellow is studying this and other method based on M-estimators and moment estimators. It is expected the publication of a journal paper in 2016.

The wok on Item (4) produced intriguing links between statistics and algebraic geometry. In [MOZ] tree cumulants are used in order to show that the secant variety of the Segre variety is locally toric and the local properties of these secant varieties using toric geometry. Similar techniques were applied to other varieties in (Manivel, Michałek 2014) and other papers. A more general overview of statistic-based coordinate changes and how they can be used in algebraic geometry was given in [CCMRZ]. In [MSUZ] the authors developed a theory of exponential varieties that are projective varieties with a distinguished set of positive points and that map bijectively to convex cones. Results about their geometry were directly inferred from the statistical theory of exponential families.

Attainment of milestones:

The two milestones listed for Item (1) were superseded already in 2013 by Allman, Rhodes, Taylor. During the project the fellow extended and greatly simplified those results for some special cases in [ARSZ].

One paper was expected for Item (2) showing how semi-algebraic description enhances statistical inference. The papers [ARSZ] and [WMZ] partly show this. Two further papers are based on two main ideas: positive dependence [FLSUWZ] and linear dependence (joint with G. Marchetti and N. Wermuth).

For Item (3) there was one milestone to achieve. Many results have been obtained and the main paper is still under preparation (joint work with E. Riccomagno).

The two milestones listed for Item (4) were achieved in [CCMRZ] and [ZUR]. Extensions can be found in papers of other authors and in [MSUZ]. Work in progress with M. Michałek will provide further results.

Additional work:

Latent tree models studied in this project are not constrained to discrete data. The Gaussian case, relevant to many applications in machine learning, can be usefully studied with the techniques developed during this project. The paper [DLWZ] deals with the large sample behaviour of the likelihood function for Gaussian latent tree models. Analytical formulas are provided and used for improving model selection procedures.

The connection between Gaussian latent tree models, Brownian motion tree models and linear covariance models is developed in [ZUR] where the authors the likelihood function for linear covariance models and show that, although the likelihood function has multiple modes, with high probability its maximisation looks like convex programming.

For general Gaussian graphical models, in [DZ] the stabiliser of a general chain graph model was computed. This realizes these popular models as transformation families and opens path for constructing equivariant estimators.

Objectives (1) and (2) of this project were linked by a recent paper [ASSZ] where the authors provided the full semialgebraic description of Gaussian latent tree models and proposed a simple method for using this description for model inference. Positive dependence in the context of Markov structures is studied in [FLSUWZ]. The relationship to latent tree models is investigated with S. Lauritzen and C. Uhler in a paper that will be ready for publication at the beginning of 2016.