CORDIS - EU research results
CORDIS

Language Evolution: The Empirical Turn

Final Report Summary - EVOLAEMP (Language Evolution: The Empirical Turn)

The project is located within the emerging field of computational historical linguistics. This budding sub-discipline of linguistics aims at reconstructing and modeling the linguistic past, as well as uncovering general laws and patterns governing language change, by computational means. Current work in this regard builds on the insight that language change shares crucial characteristics with biological evolution and mostly extrapolates models and tools from bioinformatics to the domain of language.

The Evolaemp project specifically aims at a deeper understanding on the commonalities and differences between biological evolution on the one hand, and the cultural evolution of languages on the other hand. While both processes involve replication of discrete structures within populations (biomolecular sequences; sound sequences and grammatical rules) and diversification under reproductive/communicative isolation, the analogy breaks down in several crucial respects. The project focuses on two such aspects:

1. Sound change is fundamentally different from DNA mutations. Regular sound change generally affects all instances of a given sound, not just individual instances, and it is often conditioned by the phonetic context. Also, both regular and irregular sound change is conditioned by syntagmatic patterns and factors involving communicative functionality
2. Horizontal transmission of linguistic features under contact, such as word borrowings, is a substantial driving force of language change. Standard methods of phylogenetic inference do not factor in these effects, which leads to systematic biases.

Last but not least, standard bioinformatics tools rely on large databases of multiply aligned sequences. In computational historical linguistics, comparable resources still have to be established.

The project's activities centered on four sub-goals:

1. Collecting a highest-quality database of IPA-transcribed word lists dubbed NorthEuraLex, comprising translations of more than 1,000 concepts into more than 100 languages, mostly from Northern Eurasia. The selection of languages is designed to ensure a complete coverage of one well-studied language family, Uralic, and a representative selection of languages that either have been in contact with it or are candidates for deep shared ancestry. The database has been made freely available to the community (http://northeuralex.org). It represents the most detailed and comprehensive data collection of this type currently in existence.
2. Adapting (pairwise and multiple) sequence alignment techniques from bioinformatics to the alignment of words, i.e. of phonetic strings. This is challenging since both universal patterns of sound change and language-specific pecularities have to be taken into account. Also, the amount of data required to sufficiently train such a model is only partially available. The project developed two heuristic methods for phonetic sequence alignment. Both utilize the information content of individual segments or pairs thereof. The first method (PMI alignment, for Pointwise Mutual Information) captures universal tendencies of sound change and is applicable if the amount of data for each individual language is small. The other method (IWED alignment, for Information Weighted Edit Distance) is more suitable for long word lists (at least 500 words per language).
3. Automatically detecting borrowing events, i.e. loanwords and their source language. We pursued two strategies in this regard. Utilizing phylogenetic inference, we performed ancestral state reconstruction of cognate classes and identify loci where homoplasies, i.e. parallel innovations are inferred. These loci are candidates for borrowing events. Simultaneously we applied the statistical technique of causal inference to detect language contact. This collection of methods discovers directed causal relationship between variable (in our case: languages), which are to be interpreted either as common ancestry or as contact. If a language is influence by more than one other language, this constituted proof of borrowings. This approach yielded surprisingly good results for various well-studied linguistic areas, such as the languages around the Baltic Sea (Finnic, Baltic, Slavic and Germanic).
4. Applying cutting-edge machine learning tools such as Deep Learning to automatically extract discrete features (aka characters) from unannotated word lists. These discrete features, sharing formal characteristics with DNA letters in bioinformatics, were used to automatically infer a World Tree of Languages, i.e. a family tree of languages covering almost 7,000 languages and dialects. It is in very good accordance with established expert classifications but has a much wider scope than what can realistically be done manually.