Skip to main content

New Zealand Estonian French Research Exchange Scheme

Final Report Summary - NEFREX (New Zealand Estonian French Research Exchange Scheme)

Darwin wrote that a tree of languages might be a good proxy for one of biology, which, at that time, could only be inferred through the interpretation of phenotypes. Developments in molecular sequencing technologies, linguistic databases, computational methods and data processing capacity, during the past twenty years have greatly improved the potential to map the diversification of genes and languages through space and time. In particular, coalescent based approaches to both linguistic and genetic data offer the possibility of comparing phylogenetic trees for the congruences suggested by Darwin. The wealth of cultural and genetic data now available, however, in conjunction with improved analytical approaches, provides the means to go beyond the simple comparison of historical relationships represented by dendritic patterns of branching.

This IRSES funded project (NEFREX 2013-2016) was conceived to develop new data and methods for harnessing the legacy of the human past carried in the genome, shaped by demography and selection, and marrying it with the wealth of information contained in language about the social and technological processes that have contributed to the formation of contemporary patterns of cultural and genetic diversity. The current time depth for the joint analysis of genes and language centres on the Neolithic transition, but within NEFREX the possibility of extending joint analysis of genes and culture to encompass the post-glacial expansions of people throughout Eurasia is also explored.

The approaches to these challenges taken by researchers working under the NEFREX umbrella utilises (i) full likelihood Bayesian computational phylogenetic methods using the BEAST suite of software developed at the University of Auckland, (ii) haplotype-bases methods of detecting past relationships between contemporary populations and (iii) the development of novel Approximate Bayesian Computational (ABC) methods suited to single and joint analysis via the use of summary statistics. Data sets used in analyses includ existing public databases of language and genetic diversity, pre-existing unpublished data, and new data produced during the lifetime of the project.

Training and the acquirement of new skill sets in data processing were central to the exchanges of researchers between the ERA and the third partner country, and the implentation of the full likelihood based aspects of analysis. These were achieved through workshops, seminars, one-to-one training, and interaction with collaborators linked to both ERA and the hosting institutions. Exchange visits provided interfaces with other specialists in human cultural evolution (archaeologists and linguists) to broaden the knowledge of ERA researchers and improve their ability to collaborate in the multi-disciplinary environment necessary to evaluate independent lines of evidence required for testing hypotheses concerning human prehistory.

The non-likelihood Bayesian inference (ABC) methods development resulted in two novel software programs: the first for use with the step-shift in density of genetic data available through the advent of whole genome sequencing; the second, for the joint analysis of linguistic and genetic data sets from populations to evaluate potential congruences and differences. In participation with major projects on worldwide genetic diversity, ERA researchers also gained crucial experience in methods for identifying tracts of chromosomal DNA shared between genomes, using their length distribution to infer shared ancestry between groups, an expertise used in case studies reported here.

There are five peer-reviewed publications directly associated with NEFREX through the activities of the ERA researchers. Two of these resulted from the sequencing of over 400 whole genomes from worldwide populations, led by the Estonian Biocentre. These papers examined signals related to selection, both biological and cultural, and past demographic movements of people, in the Y chromosome and the autosomes. Two of the ERA researchers concerned, Mait Metapalu and Monika Karmin, recieved the 2017 Estonian Science award for excellence in Molecular Biology. The data produced is destined for use with the ABC software developed for genomic sequence data.

Research projects that looked directly at the relationship between language and genetics included the comparision of communities speaking both Indo-Aryan and Turkic languages in Central Asia, comparing the resulting phylogenetic trees for correlations with culture (pastoral and agricultural subsistence strategies), and to detect cases of language shift in the past. The analysis of Uralic speakers using fine-scale analysis tools detected evidence for the sharing of genetic tracts between groups, and for Y chromosome similarities between groups, which remain after controlling for geography. A third study generated new data from the Society Isles in the central Pacific to better understand the settlement of this vast tract of ocean and to gain a different perspective of Polyensians origin within the Austronesian diaspora.

These studies of Uralic and Austronesian speakers are of particular interest because these linguistic families are potential exceptions to the language-farming hypothesis, which argues that all major language families are primarily driven by the spread of agriculture. This is a controversial area, but in the search for a possible third influencing factor such as technology, culture, or environment, the legacy of the past carried by language may be crucial. Consequently, this research relates to some of the major questions concerning human migration and how to distinguish between language shift and demic diffusion. This research is ongoing, based on the advances in data and computational methods acheived by the NEFREX project.

The improvements in reconstructing human demographic history witnessed under NEFREX has an important role for understanding the expression of non infectious disease regimes in contemporary populations; such as Polynesians, who suffer from high levels of obesity related diseases and cancer. Estimates of ancient demography, together with fine-scale analysis of source populations, can help to distinguish the role of stochastic processes in the emergence of disease related genotypes. Within a recent time-frame, these are often polygenic in nature, and likely to be linked to changes in environmental pressures, such as the diversity of pathogens experienced in ancestral groups, making detection by classical approches problematic. Understanding the role of cultural choices, mediated through langauge, on genetic variation, together with detailed analysis of the effects of drift on the standing genetic variation between different groups, may be crucial to providing solutions to these issues in the future.

Website address:

Contact: Phillip Endicott (