Skip to main content
European Commission logo print header

Computer-Assisted Language Comparison: Reconciling Computational and Classical Approaches in Historical Linguistics

Periodic Reporting for period 4 - CALC (Computer-Assisted Language Comparison: Reconciling Computational and Classical Approaches in Historical Linguistics)

Reporting period: 2021-10-01 to 2022-03-31

Our research project has the goal of bridging the gap between traditional, i.e. classical, non-technical approaches to historical language comparison, and new, quantitative, computational, and very technical approaches, that have been proposed during the past 20 years. Since by now, the situation has reached a dead-end, where scholars representing classical historical linguistics and scholars representing computational historical linguistics often mistrust each other heavily, our project’s major goal is to bridge the differences between the disciplines by embracing both classical and computational scholarship and emphasizing that the two can excellently work together, if only the importance of manual as well as the importance of automatic approaches are are acknowledged, and mediated in a computer-assisted instead of a purely computer-based framework.

The importance of our project is three-fold. First, by enhancing the ways in which languages can be compared, we contribute actively in shedding light on the history of the world’s languages, which indirectly also sheds light on human history in general. Second, since there are tight connections between languages and cultures, our project contributes actively to any attempts to understand and explain linguistic and cultural diversity on earth. Third, since languages also reflect human cognition, the methods we produce offer a new perspective on human perception and cognition through the lens of the numerous highly diverse human languages spoken at the moment.

Our specific approach to bring a computer-assisted framework of historical language comparison to life consists in produce data that is human- and machine-readable at the same time. We use specific interfaces that allow humans to access and correct the data produced by our software, while at the same time making sure that any human correction adheres to our standards. This basic principle is addressed in our project in multiple ways. We produce new algorithms to automatically compare the languages of the world, as well as interfaces and interactive applications that present linguistic data in a visually appealing way, helping humans to detect patterns. Last, but not least, we actively standardize lexical data of the languages in the world and contribute to the big task of preserving the heritage of more than 7000 languages currently spoken on earth.
The project has been very successful so far, as reflected in 21 publications published so far and four more publications being accepted. In addition, we created and expanded four databases that are publicly shared and allow scholars from all over the world to investigate our findings qualitatively and quantitatively. We proposed new standards for the handling of lexical data on a cross-linguistic scale, and published regular updates on six software libraries that can be used by scholars to perform automated tasks in linguistics and neighboring disciplines. To allow scientists to make the best use of our software, we gave tutorials and introductions into our systems on many occasions, especially also in countries where these methods are not yet as frequently used, and we also regularly published blog posts in which we illustrated how certain problems can be solved.

Apart from the scientific output of the project, which is also reflected in talks given at scientific events, we also managed to share our findings to a larger public, as reflected in the numerous interviews given by the principal investigator, the press releases, and also the regular blog posts written by the members of the groups, in which scientific results were discussed for a broader public, and software was explained for interested scientists.

As one of its major contributions, the project led to a first proposal regarding the age and the dispersal of the Sino-Tibetan languages, dating the age of the family to around 7200 years before present, and proposing North China as the original location. These findings may be revised in the future, but as of now, they seem to be best supported by the current state of the art, as reflected in a newly established dataset of 50 Sino-Tibetan language varieties, and the use of state-of-the-art software for annotation and analysis.

In addition, we have achieved all major goals, as outlined in our project up to this time, by proposing new methods for the unification of data in our field, in new algorithms to handle difficult and novel tasks that could so far not be consistently computed, and in the initial planning of future tasks that will be handled during the second part of the project. The work on tools and interfaces has successfully resulted in the publication of a first proto-type that has already been used in order to handle the data underlying our major study on the age and dispersal of Sino-Tibetan languages. With respect to our work on the creation and dissemination of linguistic data, we have presented a large collection of high-quality datasets for a large number of the world’s languages.
We expect to be able to stick to our plan with only minor deviations and expect that we will be able to a) illustrate the feasibility of the framework we envision, b) present some new solutions to outstanding problems in the field of computational linguistics (specifically the handling of lexical borrowing), and c) illustrate the benefits of cross-linguistic empirical research with respect to the three major lines of evidence it provides, namely human history, human culture, and human cognition.
Dispersal of Sino-Tibetan languages as inferred in the project's publication by Sagart et al. (2019)