Periodic Reporting for period 4 - CALC (Computer-Assisted Language Comparison: Reconciling Computational and Classical Approaches in Historical Linguistics)
Reporting period: 2021-10-01 to 2022-03-31
The importance of our project is three-fold. First, by enhancing the ways in which languages can be compared, we contribute actively in shedding light on the history of the world’s languages, which indirectly also sheds light on human history in general. Second, since there are tight connections between languages and cultures, our project contributes actively to any attempts to understand and explain linguistic and cultural diversity on earth. Third, since languages also reflect human cognition, the methods we produce offer a new perspective on human perception and cognition through the lens of the numerous highly diverse human languages spoken at the moment.
Our specific approach to bring a computer-assisted framework of historical language comparison to life consists in produce data that is human- and machine-readable at the same time. We use specific interfaces that allow humans to access and correct the data produced by our software, while at the same time making sure that any human correction adheres to our standards. This basic principle is addressed in our project in multiple ways. We produce new algorithms to automatically compare the languages of the world, as well as interfaces and interactive applications that present linguistic data in a visually appealing way, helping humans to detect patterns. Last, but not least, we actively standardize lexical data of the languages in the world and contribute to the big task of preserving the heritage of more than 7000 languages currently spoken on earth.
Apart from the scientific output of the project, which is also reflected in talks given at scientific events, we also managed to share our findings to a larger public, as reflected in the numerous interviews given by the principal investigator, the press releases, and also the regular blog posts written by the members of the groups, in which scientific results were discussed for a broader public, and software was explained for interested scientists.
As one of its major contributions, the project led to a first proposal regarding the age and the dispersal of the Sino-Tibetan languages, dating the age of the family to around 7200 years before present, and proposing North China as the original location. These findings may be revised in the future, but as of now, they seem to be best supported by the current state of the art, as reflected in a newly established dataset of 50 Sino-Tibetan language varieties, and the use of state-of-the-art software for annotation and analysis.
In addition, we have achieved all major goals, as outlined in our project up to this time, by proposing new methods for the unification of data in our field, in new algorithms to handle difficult and novel tasks that could so far not be consistently computed, and in the initial planning of future tasks that will be handled during the second part of the project. The work on tools and interfaces has successfully resulted in the publication of a first proto-type that has already been used in order to handle the data underlying our major study on the age and dispersal of Sino-Tibetan languages. With respect to our work on the creation and dissemination of linguistic data, we have presented a large collection of high-quality datasets for a large number of the world’s languages.