Productive Signs. A Computer-Assisted Analysis of Evolutionary, Typological, and Cognitive Dimensions of Word Families

Project Information

ProduSemy

Grant agreement ID: 101044282

DOI

10.3030/101044282

EC signature date 19 April 2022

Start date 1 January 2023

End date 31 December 2027

Funded under

European Research Council (ERC)

Total cost

€ 2 000 000,00

EU contribution

€ 2 000 000,00

2 000 000,00

Coordinated by

UNIVERSITAT PASSAU
Germany

Periodic Reporting for period 1 - ProduSemy (Productive Signs. A Computer-Assisted Analysis of Evolutionary, Typological, and Cognitive Dimensions of Word Families)

Reporting period: 2023-01-01 to 2025-06-30

Words which share a common origin within and across different languages are called word families. Through dynamics of language use, these families interact and evolve, a fact that remains largely ignored in the language sciences. The EU-funded ProduSemy project will create computer models to standardise word family data across languages. The models will be applied to data from historical, typological, and cognitive linguistics and help to learn more about the numerous ways in which word families are composed and structured in these disciplines. In this way, the project will contribute to the integration of methods and data in
linguistics, cognitive science and psychology.

The major work packages of the project include Modeling, Inference, Analysis, Output, and Outreach.

With respect to modeling, we have managed to provide first models for the handling morphemes and language-internal cognates in multilingual wordlist (List forthcoming, submitted in 2024) and in specific languages like Chinese (Pulini and List 2024). With the publication of the NoRaRe database, we laid the foundation for a successful handling of semantic data in cross-linguistic applications (Tjuka et al. 2023). In order to handle uncertainty in cross-linguistic annotations, we proposed new models for the handling of cognate sets (List et al. 2023). In order to handle sound sequences more realistically, we proposed new annotation standards (List et al. 2024), and presented methods to model sounds as vectors (Rubehn et al. 2024).

With respect to inference, List (2023), presetned a new method for the inference of partial colexifications from multilingual wordlists, which was further tested on new data by Bocklage et al. (forthcoming) and Tjuka and List (2024). In a study by Miller and List (2023), we presented a new method for the detection of lexical borrowings from dominant languages. In order to improve the inference of word families in wordlists, we proposed a new method that trims phonetic alignments in order to identify correspondence patterns in multilingual wordlists (Blum and List 2023). We have also put our inference methods to concrete use in developing cross-linguistic datasets for various language families, including Rgyalrongic languages (Lai and List 2023), and languages from Lowland South America (Blum et al. 2024a), historical languages from India and Africa (Forkel et al. 2024). We also introduced first methods that help to extend wordlist collections from existing resources (Blum et al. 2024b), and presented a registered report testing deep relations between Panoan and Tacanan languages (Blum et al. 2024c).

With respect to analysis, we published a new software suite — EDICTOR 3 — that facilitates the computer-assisted analysis of cross-linguistic data (List and van Dam 2024) and applied the tool in a phylogenetic study on Tibetic languages (Dhakal et al. 2024). In order to assess the robustness of cross-linguistic datasets, we conducted a larger study on phoneme inventories and how they compare when being compiled by different scholars for the same languages (Anderson et al. 2023). We investigated lexical semantics and lexical motivation patterns in body part semantics (Tjuka et al. 2024, Tjuka and List 2024) and laid out the foundation for an improved analysis on mutual intelligibility among closely related languages and dialects (Nieder and List 2024). In a larger essay, the PI laid out open problems for the field of computational historical linguistics in general and the research project in particular (List 2024).
The project organized its startup workshop as part of a dedicated Focus Stream focusing on Productive Signs at the International Conference of Linguists in Poznań (August 8 to 14). With respect to the output, the project published 24 papers (all peer reviewed) in journals and conference proceedings, and members of the project gave 16 talks on workshops and conferences. The project hosts a scientific blog, which has been published also as a non-peer-reviewed journal for short open tutorials, using the OJS system for journal management with the University of Passau (Computer-Assisted Language Comparison in Practice, https://calc.hypotheses.org https://osj3.uni-passau.de/index.php/calcip). This blog has monthly contributions, most written by members of the project, resulting in as many as 24 tutorials and short data notes that have been published in addition to peer-reviewed contributions.
With respect to the outreach to a larger scientific public, the PI is writing monthly blog posts in German that target popular science topics (https://wub.hypotheses.org) resulting in another 24 contributions presenting the work of the project to laypeople, with a readership between 300 and 800 people per month. The PI was also interviewed on the origin of human languages (Planet Wissen, “Sprachwunder Mensch”) in May 2023. Towards the end of 2024, the PI gave an interview to the German Press Agency (DPA), which was published by several 100 German newspapers in print and online.

The project managed to develop a new methodology — as mentioned as an important achievement in our description of action — by establishing an algorithm for the inference of partial colexifications from multilngual wordlists. This algorithm was described in a publication by List (2023) and accompanied by an open software package.
Novel methodologies that have not been published in the reporting period but which have been developed in this time include new methods for the creation of concept embeddings (published as a preprint by Rubehn and List 2025), and a novel method for the affiliation of languages to language families (published as preprint by Blum et al. 2025).
These novel methodologies reflect the great innovative potential of the project to provide new solutions for outstanding problems in computational historical linguistics.

One of our most significant achievement, which we were trying to push to publication during the first two years of the project consist in new datasets, as reflected in the Lexibank repository that was published in Version 2.0 in early 2025 (work had been carried out throughout the whole first reporting period of the project). This repository now contains more than 2000 different languages. A similar repository is CLICS⁴, a collection of data on cross-linguistic colexifications that now contains more than 2000 languages and was published in early 2025 (Tjuka et al. 2025).
Another great achievement is the method for the inference of partial colexifications from multilingual wordlists, as outlined in List (2023), specifically also since this method has a lot of potential to be applied in other contexts, as illustrated in our novel concept embedding methodology, mentioned in § 1.2 (Rubehn and List 2025, preprint).
A third great achievement of the project was to finish the EDICTOR interactive tool for the curation of cross-linguistic data, which has now been published in a new version 3.0 with many new features implemented in this new and stable version (https://edictor.org see also List and van Dam 2024).

Our work on partial colexifications (List 2023, Tjuka and List 2024) can be seen as a breakthrough, since it contains a completely novel method, with clear new insights and a lot of potential to inspire additional methods and analyses in the future.

Word families and the ways in which they are created.

Periodic Reporting for period 1 - ProduSemy (Productive Signs. A Computer-Assisted Analysis of Evolutionary, Typological, and Cognitive Dimensions of Word Families)

Download Download the content of the page