Skip to main content
Go to the home page of the European Commission (opens in new window)
English en
CORDIS - EU research results
CORDIS

Cross-Linguistic statistical inference using hierarchical Bayesian models

Periodic Reporting for period 4 - CrossLingference (Cross-Linguistic statistical inference using hierarchical Bayesian models)

Reporting period: 2024-04-01 to 2025-09-30

Historical linguistics and linguistic typology are core disciplines of the humanities. Over the past two centuries, meticulous comparative work has produced detailed knowledge about the origin and evolution of several language families, especially Indo-European. In recent years, the combination of computational methods—many of them adapted from evolutionary biology—with the massive growth of digitally available linguistic data has opened up fundamentally new possibilities. These quantitative approaches make it possible to detect signals of prehistory and to identify general patterns of language change on a scale that goes far beyond what is possible with traditional qualitative methods.

So far, most computational studies have focused on individual language families in isolation. The CrossLingference project seeks to go a decisive step further by developing statistical models of linguistic evolution that capture both universal patterns and the family-specific trajectories of language change. Bayesian hierarchical modeling plays a central role, as it allows us to infer the appropriate balance between idiosyncratic developments and cross-linguistic regularities in a principled, data-driven way.


The societal relevance is threefold. (i) By collecting, standardizing, and publishing large-scale cross-linguistic datasets, the project contributes to documenting and preserving many under-described indigenous languages. (ii) The study of language evolution provides a model for cultural evolution more broadly, offering quantitative tools for studying aspects of human prehistory in a non-Eurocentric and globally comparative fashion. (iii) The project develops open-source software tools—ranging from cognate detection algorithms to phylogenetic inference pipelines—that are useful not only within linguistics but also for applications in multilingual natural language processing and other data-science domains.

Conclusions of the action.
During the final reporting period, CrossLingference successfully developed and evaluated a suite of Bayesian and machine-learning models for lexical and typological inference. It produced several open cross-linguistic datasets, enhanced tools for cognate detection and phylogenetic reconstruction, and contributed methodological advances such as cross-family inference and Gaussian Process modeling. Together, these outcomes significantly advance the computational study of language evolution and provide a solid empirical and methodological foundation for future large-scale comparative research.
The project addressed three central scientific challenges in statistical historical linguistics and linguistic typology:

(1) the development of statistical models capable of explaining diachronic and synchronic linguistic diversity at both the lexical and grammatical levels;
(2) the assessment of the reliability and limits of inference with such models; and
(3) the identification of family-specific historical events, such as sound changes, language contact, or borrowing processes, in a principled computational framework.

(1) Model development.
Over the course of the project, several advanced modeling frameworks were developed and implemented. These include Bayesian phylogenetic models that integrate information about shared ancestry, spatially informed models using Gaussian Processes and autoregressive formulations to capture contact effects, and hierarchical models that enable information to be shared across families and macroareas. In addition, the project introduced computational pipelines for large-scale cognate detection, probabilistic models for sound change and lexical evolution, and hierarchical OU and CTMC models for typological features. These methods were evaluated on a broad set of datasets, including Lexibank, Grambank, ASJP, and multiple family-specific lexical resources.

(2) Model assessment and validation.
Inference quality was systematically investigated using sensitivity analyses, prior predictive and posterior predictive checks, missing-data imputation experiments, and cross-validation. The results demonstrate that the models perform strongly in prediction—particularly in reconstructing missing lexical or typological data—but are more limited when it comes to inferring latent historical variables such as ancestral states or unobserved contact scenarios. These findings nuance earlier optimistic assumptions about typological inference and have led to a more rigorous understanding of the conditions under which reliable inference is possible.


Results, exploitation, and dissemination.
The project produced a substantial set of openly available research outputs, including peer-reviewed articles, family-specific datasets, cognate detection resources, and software tools for phylogenetic inference. All datasets and code repositories have been released as open access. They are already being used by the wider community, and they form a foundation for follow-up projects. Results were disseminated through journal publications, conference presentations, workshops, invited talks, and contributions to international research infrastructures such as Lexibank and CLDF.
The first major innovation is the development of hierarchical Bayesian phylogenetic comparative models for typology. Unlike standard phylogenetic models that treat each language family in isolation, the new framework allows information to be shared across families and macro-areas. This hierarchical structure makes it possible to simultaneously estimate universal tendencies in language change and family-specific idiosyncrasies, while properly accounting for genealogical relatedness and uncertainty in phylogenetic structure. The new framework has yielded more stable estimates of cross-linguistic universals.

A second major innovation is the application of Gaussian Process models to represent spatial and areal effects in language evolution. These models provide a principled way to quantify the contribution of geographical proximity and contact to observed linguistic patterns, and to disentangle contact-induced similarity from genealogical signals. The use of Gaussian Processes allows spatial influence to be modeled flexibly and non-parametrically, avoiding the oversimplifications of earlier distance-based or areal-dummy approaches. This has led to more fine-grained and empirically grounded analyses of contact zones, especially in regions where genealogical and areal patterns interact strongly.

Together, these two methodological advances have enabled a more integrated view of language evolution—one in which genealogical, spatial, and universal factors can be jointly estimated rather than treated as competing explanations. Several empirical case studies conducted during the project (e.g. on word order universals, phoneme inventory structure, and areal effects in Africa and the Americas) demonstrate that these models outperform traditional non-hierarchical or non-spatial methods both in predictive accuracy and interpretability.

In addition, the project developed complementary computational tools, including improved pipelines for cognate detection, probabilistic models of sound change, and continuous-space representations of lexical data. While these components were not the primary methodological focus, they further contributed to improving the empirical basis for large-scale historical inference.
Map of South America showing the geographic distribution of Tupí-Guaraní languages together with inf
My booklet 0 0