## Final Activity Report Summary - StatInfPopGen (Statistical inference in population genetics)

Genes contain information not only in the usual sense of coding of proteins and regulatory molecules that are basic to the development and maintenance of living organisms, but also at the population and species level so that genetic variation among these provides insights on their origins and, ultimately, on their past history. Population genetics theory, which was developed for a century now, has thoroughly analysed the effects of major species’ history events on genetic variation of populations. These events include variation of population size, divergence or admixture of populations. Since the data are allelic frequencies of marker genes, the conclusions of such studies clearly depend on the way these genes evolve through mutations.

Although the abovementioned theory was not new, experimental population geneticists, who are now able to gather lots of molecular marker data, needed methods for their efficient use to tackle questions about the history of populations. By the term ‘efficient use’, we imply that quantitative answers must be given, such as the relative probability of two or more possible histories or the time at which two populations diverged. To get this type of answers, a statistical approach had to be developed.

In this project, we focussed on two specific approaches, through the combination of three models for population history, gene evolution, i.e. mutation model, and gene genealogy in populations, i.e. the coalescent model. As in many fields of science, the complexity of problems prevented us from using exact solutions; hence both approaches were based on computer intensive stochastic simulations. In the first one, gene genealogies, compatible with the data and the population history, were simulated in order to estimate the likelihood of the data, i.e. the probability of the data given the models. In the second approach, the data themselves were simulated and the inference was drawn from their amount of similarity with the observed data. These two approaches were developed for simplistic population histories involving only one or two populations. Our main contribution via this project was to develop solutions and software for dealing with more complex, and hence more realistic, histories. As an example, our software was used to choose among seven possible histories of African pigmies based on genetic data from 21 populations.

Although the abovementioned theory was not new, experimental population geneticists, who are now able to gather lots of molecular marker data, needed methods for their efficient use to tackle questions about the history of populations. By the term ‘efficient use’, we imply that quantitative answers must be given, such as the relative probability of two or more possible histories or the time at which two populations diverged. To get this type of answers, a statistical approach had to be developed.

In this project, we focussed on two specific approaches, through the combination of three models for population history, gene evolution, i.e. mutation model, and gene genealogy in populations, i.e. the coalescent model. As in many fields of science, the complexity of problems prevented us from using exact solutions; hence both approaches were based on computer intensive stochastic simulations. In the first one, gene genealogies, compatible with the data and the population history, were simulated in order to estimate the likelihood of the data, i.e. the probability of the data given the models. In the second approach, the data themselves were simulated and the inference was drawn from their amount of similarity with the observed data. These two approaches were developed for simplistic population histories involving only one or two populations. Our main contribution via this project was to develop solutions and software for dealing with more complex, and hence more realistic, histories. As an example, our software was used to choose among seven possible histories of African pigmies based on genetic data from 21 populations.