European Commission logo
italiano italiano
CORDIS - Risultati della ricerca dell’UE
CORDIS

Cross-Linguistic statistical inference using hierarchical Bayesian models

Periodic Reporting for period 2 - CrossLingference (Cross-Linguistic statistical inference using hierarchical Bayesian models)

Periodo di rendicontazione: 2021-04-01 al 2022-09-30

Historical linguistics and typology belong to the core disciplines of the humanities. Diligent manual work, over a period of two centuries, amassed detailed knowledge about the origin an evolution of several language families, especially Indo-European languages.

Thanks to the adoption of techniques from computational biology and the massive growth in available electronic language data from a wide variety of language, great strides have been made in recent years in complementing the traditional approach with computational modeling of language change. These quantitative approaches allow to detect signals about the pre-history of language families and general patterns of language change on a larger scale than what is possible by classical methods. So far, computer-aided studies were confined largely to individual language families. The CrossLingference project strives to construct statistical models of linguistic evolution capturing both universal patterns and the specifics of individual language families. Bayesian hierarchical modeling is the tool of choice because it enables to derive the correct balance between idiosyncratic and general phenomena in a data-driven fashion.

Benefits for society are two-fold. By collecting, standardizing and publishing large-scale cross-linguistic data sets and using them for analysis, knowledge about under-documented indigenous languages from all continents is acquired and made available to the public. Furthermore, the study of language evolution is a model for cultural evolution more broadly. Therefore computational historical linguistics, as practiced by the CrossLingference project, plays a pioneering roll for the scientific and quantitative study of prehistory in a non-eurocentric fashion. Last but not least, the project's work contributes to the advance of multilingual natural language processing tools.

Furthermore the project produces open-source software tools potentially useful for unrelated tasks in data science.

There are two overarching objectives of the project:

- gaining knowledge about contingent events in the history of specific languages, e.g. the sound changes that happened during the evolution of the Uralic language family, or which words were borrowed from Turkic to Uralic languages in prehistoric times, and

- gaining a deeper understanding of the universal trends in language change and how they shape the diversity of the languages of the world how they are spoken today.
The project's work addresses three scientific issues in statistical historical linguistics and typology:

1. What statistical models are suitable to explain the diachronic and synchronic diversity of human languages at the lexical and the grammatical level? How are these models to be implemented, and what data are needed to train them?
2. How reliable is inference with these models, and how can this be assessed?
3. What contingent properties and events can be inferred?

With regard to the first question, the project developed several advanced modeling frameworks combining (a) phylogenetic information to model shared ancestry of languages, (b) spatial information (Gaussian Processes; autologistic models) to accommodate the effects of language contact, and (c) a hierarchical structure to enable extrapolation between different language families and linguistic areas. The second question was addressed in two ways: (a) sensitivity analysis of typological inference using different priors and (b) missing data imputation in combination with cross-validation. Briefly put, our results indicate that our models perform well for prediction but are unreliable regarding inference of latent variables, especially with regard to typology. Consequences of these findings for the future work of the project are currently being investigated. With regard to the third question, the main insight provided by the project is a vindication of thy typological notion of word order universals across the languages of the world, thus superseding earlier findings in the literature which cast doubt on this notion.
Close to the end of the reporting period, we achieved a major breakthrough by leveraging artifical neural networks – more specifially a sequence-to-sequence autoencoder – for data preprocessing. It enables us to faithfully convert discrete primary data into high-dimensional continuous data. This affords the deployment of multivariate statistical models which make use full use of the information in the data. The models currently in use (both within the project and within the field) are based on discrete variables and only use a fraction of the information in the data. Utilizing this new approach will require us to re-assess the results listed in the preceding section. It seems likely that the partially sobering theoretical results mentioned there will not hold, potentially leading to a paradigm shift in the statistical modeling of language diversity.