# MED-CORE Informe resumido

Project ID:
ICA3-CT-2002-10003

Financiado con arreglo a:
FP5-INCO 2

País:
Italy

## Statistical methods to analyse angular data

The problem was to estimate the effects of multiple environmental variables on orientation in tests conducted under natural conditions. The statistical analysis of the effects of factors on the circular response was carried out by assuming a Projected Normal distribution of the directions, instead of the usual von Mises. We used a regression model proposed by Presnell et al. (Presnell B, Morrison S P, Littel R C, J. Amer. Stat. Ass. 93, 443: 1068-1067, 1998) which they call Projected Multivariate Linear Model (PMLM).We chose to use this model as other previously proposed methods suffered from difficulties in the computation of parameters estimates (probably this is one of the causes that regression methods are so little used in the analysis of directional data).

This is a parametric model which assumes that the directions in every combination of the factors are distributed as Projected Normals, i.e. like the projections onto the unit circle of a bivariate standard normal distribution. Any von Mises distribution is closely approximated by a projected normal with the same circular mean and mean resultant vector. The Projected Normal Distribution can be parameterized with the mean vector of the bivariate normal distribution. This is called Projected Normal Parameter (PNP) for convenience.

This is to be intended as a latent point of the plane whose polar co-ordinates indicate (a) the mean direction (the angle made by the vector) and (b) the concentration around this direction (the length of the vector), the further the point from the origin the more concentrated the response. The model then assumes that the position of the mean of the bivariate normal is a function of the explanatory variables. Both the mean direction and the mean resultant length depend on the explanatory variables (predictors).

When all the explanatory variables are qualitative (i.e. factors) the model is analogous to a multiway analysis of variance model with or without interactions. A model with the full set of interactions specifies a different Projected Normal for each combination of the levels of the factors. This model obviously can have a huge number of parameters (two for each cell) and thus simpler models can be advantageous if they do not deviate significantly from it. The additive analysis of variance model predicts that the PNPs (the mean vectors of the bivariate normal) follow a simple "parallelogram rule", i.e. the points corresponding to each combination of levels of two factors must form a parallelogram.

The models are fitted by maximum likelihood using the EM algorithm. With this approach several multifactor analysis of variance models can be fitted to the data. One of the advantages of the method proposed is the possibility of exploring many models including a large number of main effects and interactions. This involves a model selection stage in the analysis, in which several models are compared. All tests are based on the likelihood ratio statistic, with the relevant null asymptotic chi square distribution. However, the choice of a final parsimonious model is based on the Akaike information criterion (AIC). On the basis of this criterion which is a kind of penalized likelihood by which non nested models can also be compared, the best models are those with the lowest AIC.

There is no generally accepted measure of the overall fit of regression models for circular data. In this case the problem is complicated by the fact that the model allows varying concentrations. Note that the amount of unexplained variability can also be appraised from the estimated mean resultant vectors within the groups. The analysis of residuals should help in checking the model, and in detecting systematic as well isolated departures from the model. Two important checks should concern the assumption (a) of the Projected Normal Distribution for the directions and (b) the additivity of the analysis of variance model. This could be the object of a future methodological study.

This is a parametric model which assumes that the directions in every combination of the factors are distributed as Projected Normals, i.e. like the projections onto the unit circle of a bivariate standard normal distribution. Any von Mises distribution is closely approximated by a projected normal with the same circular mean and mean resultant vector. The Projected Normal Distribution can be parameterized with the mean vector of the bivariate normal distribution. This is called Projected Normal Parameter (PNP) for convenience.

This is to be intended as a latent point of the plane whose polar co-ordinates indicate (a) the mean direction (the angle made by the vector) and (b) the concentration around this direction (the length of the vector), the further the point from the origin the more concentrated the response. The model then assumes that the position of the mean of the bivariate normal is a function of the explanatory variables. Both the mean direction and the mean resultant length depend on the explanatory variables (predictors).

When all the explanatory variables are qualitative (i.e. factors) the model is analogous to a multiway analysis of variance model with or without interactions. A model with the full set of interactions specifies a different Projected Normal for each combination of the levels of the factors. This model obviously can have a huge number of parameters (two for each cell) and thus simpler models can be advantageous if they do not deviate significantly from it. The additive analysis of variance model predicts that the PNPs (the mean vectors of the bivariate normal) follow a simple "parallelogram rule", i.e. the points corresponding to each combination of levels of two factors must form a parallelogram.

The models are fitted by maximum likelihood using the EM algorithm. With this approach several multifactor analysis of variance models can be fitted to the data. One of the advantages of the method proposed is the possibility of exploring many models including a large number of main effects and interactions. This involves a model selection stage in the analysis, in which several models are compared. All tests are based on the likelihood ratio statistic, with the relevant null asymptotic chi square distribution. However, the choice of a final parsimonious model is based on the Akaike information criterion (AIC). On the basis of this criterion which is a kind of penalized likelihood by which non nested models can also be compared, the best models are those with the lowest AIC.

There is no generally accepted measure of the overall fit of regression models for circular data. In this case the problem is complicated by the fact that the model allows varying concentrations. Note that the amount of unexplained variability can also be appraised from the estimated mean resultant vectors within the groups. The analysis of residuals should help in checking the model, and in detecting systematic as well isolated departures from the model. Two important checks should concern the assumption (a) of the Projected Normal Distribution for the directions and (b) the additivity of the analysis of variance model. This could be the object of a future methodological study.