Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Measuring Convergence and Divergence in Varieties of Chinese: A Lectometric Approach

Periodic Reporting for period 1 - CHILECTO (Measuring Convergence and Divergence in Varieties of Chinese: A Lectometric Approach)

Reporting period: 2018-09-01 to 2020-08-31

The overall objective of the project is to explore processes of convergence and divergence among lexicons in the varieties of Chinese by using a quantitative lectometric approach. More specifically, the following three scientific objectives were addressed.

Firstly, we aimed to evaluate lectometric measurements and establish a methodological guideline with which lexical variation among lects can be structured more comprehensively and cost-efficiently.

The second objective was to identify the multifactorial variational structure that underlies the Chinese lexicon in written texts from a synchronic point of view.

The third objective examines how the stratificational distances among lects are changing progressively across time and confronts with what we find for Chinese and what we know about the processes in Europe.

CHLETCTO is the first project to explore the phenomenon of lexical variation in the varieties of Chinese by performing large-scale quantitative aggregation. It goes beyond traditional approaches to language variation by using computational/statistical methods for the identification of contextual clues and for the classification of language usages. The project explores different processes of convergence/divergence among varieties of the same language and compares the standard language situation in the Pan-Chinese region and in Europe. The project findings may raise the awareness of language policy and decision-makers to the empirically tested linguistic evidence, which plays an important role in the strategic deployment of language planning and policy.
To achieve the project objectives, we have conducted six scientific and training work packages. In WP1, the research fellow established a Personal Career Development Plan and a Data Management Plan. In WP2, the research fellow has got structured training to improve her scientific knowledge and transferable skills. In WP3, we first did a comprehensive literature review of the methodological state of the art in lectometry research and provided a more in-depth survey on the technical aspects of the distributional semantic techniques. We then established a methodological guideline for corpus-based lexical lectometry research. WP4 involved two cases studies designed to measure the linguistic distances between the three language varieties of Chinese at a certain time period. In the “lexical variation” case study, we combined both vector-driven and resource-driven selected profiles for a lexical variable. Token-level word space models were employed to do word sense disambiguation. We then calculated word choice uniformity across lects by aggregating over multiple concepts with and without the concept relative frequency as a weight. In the “grammatical variation” case study, we explored the alternation in Chinese analytic causative constructions with the markers of shi, ling and rang through data analytics. Twenty-seven variables explored in the literature were tested to examine the syntactic and semantic factors that constrain the causative alternation and to see if there is lectal variation at play. In WP5, we conducted a diachronic case study to explore the convergence or divergence in lexicons between Mainland Chinese and Taiwan Chinese. We also used token-based vector space models as semantic control to handle the issue of polysemy. We applied the t-SNE technique to visualize the semantic structure of the lexical items (Figure 1) and the density-based cluster analysis to filter those "out-of-concept" clusters (Figure 2). WP6 includes the dissemination and communication activities of the project.

The main findings achieved so far of this project include the following aspects:
• We proposed that a proper corpus-based lexical lectometry research normally should involve the following distinct tasks: (1) compilation of a lectally stratified corpus; (2) sampling concepts as measuring points for lectometry; (3) identification of lexical expressions per concept; (4) disambiguation of lexical expressions in corpus data; (5) calculation of aggregated lexico-lectometric distances; (6) evaluation of measurement reliability and validity.
• Token-based vector space modelling proved to be an effective and efficient approach for semantic control in a large-scale corpus-based study on lexical variation. We also provided a visual representation of the token-based vector space models with the help of the dimensionality-reduction technique.
• For both the non-weighted aggregation and frequency-weighted aggregation, we found a lectally structured onomasiological variation in the Chinese lexicon. We also found significant lectal differences in the choice of the causative markers (Figure 3).
• The diachronic variation in Chinese lexicons is influenced by features of the concepts, such as concept salience, lexical fields, vagueness and affect. And a diachronic lectometry helps us track how a “new” lexicalization gradually spreads from its initial contexts to new ones so that we could understand the processes of language (de)standardisation in Chinese better.

We disseminate the project results both to the scientific community and to the public. The methodological innovations and theoretical implications have been published in five peer-reviewed specialized conferences. Some findings from the case studies will be published in peer-reviewed collective volumes in late 2021 and 2022. Further dissemination was carried out through talks, most notably as the invited speaker of workshops. We have also been active in the dissemination of our results to the more general public by taking pitch recording. The data gathered within the project, as well as all articles, are open and accessible to the scientific community and the whole of society.
We have proposed in our project a descriptive and methodological guideline of applying vector space models for lexical lectometry research and shown that vector space models can be effective to detect regional lexical variation in large-scale corpora of written texts. We have turned the token-based models into useful lexicological tools for both theoretically and practically grounded linguistic questions.

Our project has a number of theoretical and empirical implications in Sociolinguistics, Cognitive Linguistics, and Chinese linguistics: (i) This is the first broad-scale empirical study of measuring the linguistic distances in the varieties of Chinese and a thorough investigation of token-based vector space models specifically intended for their use in a variationist framework. In particular, we have addressed an important issue in Sociolinguistics, i.e. the demarcation of the “envelope of variation”. (ii) It also brings our attention to the core issue in Cognitive Linguistics, i.e. what is the meaning of “meaning”? Our model allows us to model the contextual meaning of a lexical variant, therefore, we take a usage-based view of meaning, which responses to the call of Cognitive Linguistics. (iii) We have applied the methods and models that have been developed for studying changes in the position of standard languages in Europe to the situation of Chinese. It brings insights to the research on the Chinese languages, especially in the language policy context, given the fact that the language policies of the various regions might play a role in the linguistic variation.

Our workflow could also be applied to customize a search engine to the needs of various user communities.
Figure 3 Interaction between semantic class of causee and lects for the response
Figure 1 Token clouds for COUNTERATTACK
Figure 2 Token clouds for COUNTERATTACH with the cluster ID
My booklet 0 0