Subliminal learning in the Mandarin lexicon

Informations projet

SUBLIMINAL

N° de convention de subvention: 101054902

DOI

10.3030/101054902

Date de signature de la CE 13 Juin 2022

Date de début 1 Septembre 2022

Date de fin 31 Août 2027

Financé au titre de

European Research Council (ERC)

Coût total

€ 2 483 750,00

Contribution de l’UE

€ 2 483 750,00

2 483 750,00

Coordonné par

EBERHARD KARLS UNIVERSITAET TUEBINGEN
Germany

Periodic Reporting for period 1 - SUBLIMINAL (Subliminal learning in the Mandarin lexicon)

Période du rapport: 2022-09-01 au 2025-02-28

The SUBLIMINAL project starts from the observation that there are systematic regularities in the spoken language that escape our awareness, that are shielded from us by linguistic traditions and cultural conventions embodied in writing systems, but that nevertheless are detected by our brains, albeit subliminally, and used to optimize lexical processing. The project focuses on Mandarin Chinese.

Mandarin is a tone language. It has four lexical tones (high, rising, fall-rising, falling) and an additional more variable floating tone. Current theories hold that when two words form a compound, e.g. 学校 (xuéxiào), "school," the tones of the constituents inherit the canonical tones of the corresponding independent words. For xuéxiào, these tones are a rising tone followed by a falling tone. A great many studies have investigated how exactly the tones of Mandarin words are realized. This has been found to depend on the tones of neighboring words, the speech rate, and the words' sound segments, among others.

A novel and surprising discovery is that the exact way in which canonical tones are realized varies from word to word. Words appear to have their own pitch contour signatures. We have now documented the existence of word-specific signatures in the pitch contours of Mandarin words for three different datasets, all collected from a corpus of spontaneous conversations recorded in Taipei. Word-specific pitch signatures are very prominently present in the pitch contours of two-syllable words, but also clearly visible in the pitch contours of single-syllable words. The accompanying figure visualizes the word-specific pitch signatures for four words that all share the same canonical fall-rise tones, and that have syllables to their left and right sharing the same canonical tone. The word for "chemistry" (化學, lower left panel) has a rising pitch contour where the other three words have a falling pitch contour. The word for "now" (目前, upper right panel) shows a slight fall in the first syllable that is absent for 大學 ("university") and 問题 ("question").

These findings raise the question of why such word-specific signatures in tonal realization are present. One possibility that we investigated is that these signatures are due to the physics of the sounds in Mandarin words. However, the effects of consonants and vowels on tones reported in the literature have little explanatory value for our datasets. A very different possibility is that words' meanings are at issue. Words' individual pitch signatures make spoken words more distinct, which might attenuate for the listener the uncertainties that come with the widespread homophony that characterizes the Mandarin lexicon. For instance, the syllable jiù, realized with a falling tone, has meanings as diverse as 就 (then), 旧 (old), 救 (rescue), 舅 (maternal uncle), 鹫 (vulture), and 臼 (mortar). The writing system distinguishes between these meanings, and we argue that tonal signatures might have a similar function.

In computational semantics, word meanings are represented by high-dimensional numeric vectors known as embeddings. Current large language models make it possible to calculate word embeddings that are fine-tuned to the contexts in which words occur. We made use of this technology to calculate, for the word tokens in our corpus, the corresponding contextualized embeddings. We paired these semantic high-dimensional token-specific vectors with the corresponding vectors of pitch contours. It turns out pitch vectors can be predicted from their corresponding contextualized embeddings with an accuracy that far exceeds a randomization baseline. Furthermore, a given canonical pitch contour as predicted by a generalized additive model is predicted remarkably precisely from the centroid of the embeddings of the words sharing that canonical pitch contour.

The SUBLIMINAL project is also investigating the structure of the semantic space of Mandarin, using word embeddings. To bring out the specifics of the Mandarin space as compared to the semantic space of English, we created a dataset with large numbers of words across high-level semantic categories, such as words for animals, words for body parts, and words for vehicles. We used the centroids of the embeddings of the words in these categories to set up a procrustes rotation, and then applied this rotation to all words in our dataset. Visualization of the resulting joint semantic space using stochastic t-distributed neighbor-embedding revealed not only the expected global similarities in the cognitive structures of the semantic spaces of Mandarin and English, but also many subtle but well-interpretable differences in how the two languages conceptualize the world. For instance, Mandarin kinship terms are distributed in two distinct clusters, both of which are well-differentiated from the English kinship terms. One cluster of Mandarin kinship terms denotes the closest family members (e.g. 爸爸 (bàba) "dad" and 妈妈 (māmā) "mom"). By contrast, words for more distant family members are predominant in the other cluster. Some of these words are often used for polite referencing of non-relatives, such as 阿姨 (āyí), "aunt, mother’s sister," a polite form of address used for caretakers in the home and friends of one's parents.

An important goal for the remainder of the SUBLIMINAL project is to use these novel insights about form and meaning to optimize second language acquisition of Mandarin Chinese. Because teachers of Mandarin Chinese as a second language are not aware of the highly specific pitch contours that many words have, the feedback given to learners is the canonical tone pattern, rather than the pitch contours that learners are actually hearing. Our hypothesis is that this error-ridden feedback to L2-learners of Mandarin slows down learning. In collaboration with the research team of Prof. Van Rijn at the university of Groningen, who are specialists in optimizing fact learning, this hypothesis is currently being tested experimentally. The findings of this project also have far-reaching implications for standard linguistic theories. These theories work with discrete abstract units such phones, stems, affixes, and abstract features such as tones and pitch accents. However, the fine-grained isomorphies between meaning in context and tonal realization in context challenge the assumption that abstract units and features are essential for precise prediction. Abstract units and features render invisible the low-level systematicities between form and meaning that the SUBLIMINAL project is documenting.

word-specific tonal contours

Periodic Reporting for period 1 - SUBLIMINAL (Subliminal learning in the Mandarin lexicon)

Partager cette page Partager cette page sur les réseaux sociaux

Télécharger Télécharger le contenu de la page