Periodic Reporting for period 1 - ProduSemy (Productive Signs. A Computer-Assisted Analysis of Evolutionary, Typological, and Cognitive Dimensions of Word Families)
Reporting period: 2023-01-01 to 2025-06-30
linguistics, cognitive science and psychology.
With respect to modeling, we have managed to provide first models for the handling morphemes and language-internal cognates in multilingual wordlist (List forthcoming, submitted in 2024) and in specific languages like Chinese (Pulini and List 2024). With the publication of the NoRaRe database, we laid the foundation for a successful handling of semantic data in cross-linguistic applications (Tjuka et al. 2023). In order to handle uncertainty in cross-linguistic annotations, we proposed new models for the handling of cognate sets (List et al. 2023). In order to handle sound sequences more realistically, we proposed new annotation standards (List et al. 2024), and presented methods to model sounds as vectors (Rubehn et al. 2024).
With respect to inference, List (2023), presetned a new method for the inference of partial colexifications from multilingual wordlists, which was further tested on new data by Bocklage et al. (forthcoming) and Tjuka and List (2024). In a study by Miller and List (2023), we presented a new method for the detection of lexical borrowings from dominant languages. In order to improve the inference of word families in wordlists, we proposed a new method that trims phonetic alignments in order to identify correspondence patterns in multilingual wordlists (Blum and List 2023). We have also put our inference methods to concrete use in developing cross-linguistic datasets for various language families, including Rgyalrongic languages (Lai and List 2023), and languages from Lowland South America (Blum et al. 2024a), historical languages from India and Africa (Forkel et al. 2024). We also introduced first methods that help to extend wordlist collections from existing resources (Blum et al. 2024b), and presented a registered report testing deep relations between Panoan and Tacanan languages (Blum et al. 2024c).
With respect to analysis, we published a new software suite — EDICTOR 3 — that facilitates the computer-assisted analysis of cross-linguistic data (List and van Dam 2024) and applied the tool in a phylogenetic study on Tibetic languages (Dhakal et al. 2024). In order to assess the robustness of cross-linguistic datasets, we conducted a larger study on phoneme inventories and how they compare when being compiled by different scholars for the same languages (Anderson et al. 2023). We investigated lexical semantics and lexical motivation patterns in body part semantics (Tjuka et al. 2024, Tjuka and List 2024) and laid out the foundation for an improved analysis on mutual intelligibility among closely related languages and dialects (Nieder and List 2024). In a larger essay, the PI laid out open problems for the field of computational historical linguistics in general and the research project in particular (List 2024).
The project organized its startup workshop as part of a dedicated Focus Stream focusing on Productive Signs at the International Conference of Linguists in Poznań (August 8 to 14). With respect to the output, the project published 24 papers (all peer reviewed) in journals and conference proceedings, and members of the project gave 16 talks on workshops and conferences. The project hosts a scientific blog, which has been published also as a non-peer-reviewed journal for short open tutorials, using the OJS system for journal management with the University of Passau (Computer-Assisted Language Comparison in Practice, https://calc.hypotheses.org(opens in new window) https://osj3.uni-passau.de/index.php/calcip(opens in new window)). This blog has monthly contributions, most written by members of the project, resulting in as many as 24 tutorials and short data notes that have been published in addition to peer-reviewed contributions.
With respect to the outreach to a larger scientific public, the PI is writing monthly blog posts in German that target popular science topics (https://wub.hypotheses.org(opens in new window)) resulting in another 24 contributions presenting the work of the project to laypeople, with a readership between 300 and 800 people per month. The PI was also interviewed on the origin of human languages (Planet Wissen, “Sprachwunder Mensch”) in May 2023. Towards the end of 2024, the PI gave an interview to the German Press Agency (DPA), which was published by several 100 German newspapers in print and online.
Novel methodologies that have not been published in the reporting period but which have been developed in this time include new methods for the creation of concept embeddings (published as a preprint by Rubehn and List 2025), and a novel method for the affiliation of languages to language families (published as preprint by Blum et al. 2025).
These novel methodologies reflect the great innovative potential of the project to provide new solutions for outstanding problems in computational historical linguistics.
One of our most significant achievement, which we were trying to push to publication during the first two years of the project consist in new datasets, as reflected in the Lexibank repository that was published in Version 2.0 in early 2025 (work had been carried out throughout the whole first reporting period of the project). This repository now contains more than 2000 different languages. A similar repository is CLICS⁴, a collection of data on cross-linguistic colexifications that now contains more than 2000 languages and was published in early 2025 (Tjuka et al. 2025).
Another great achievement is the method for the inference of partial colexifications from multilingual wordlists, as outlined in List (2023), specifically also since this method has a lot of potential to be applied in other contexts, as illustrated in our novel concept embedding methodology, mentioned in § 1.2 (Rubehn and List 2025, preprint).
A third great achievement of the project was to finish the EDICTOR interactive tool for the curation of cross-linguistic data, which has now been published in a new version 3.0 with many new features implemented in this new and stable version (https://edictor.org(opens in new window) see also List and van Dam 2024).
Our work on partial colexifications (List 2023, Tjuka and List 2024) can be seen as a breakthrough, since it contains a completely novel method, with clear new insights and a lot of potential to inspire additional methods and analyses in the future.