Periodic Reporting for period 1 - SUBLIMINAL (Subliminal learning in the Mandarin lexicon)
Période du rapport: 2022-09-01 au 2025-02-28
A novel and surprising discovery is that the exact way in which canonical tones are realized varies from word to word. Words appear to have their own pitch contour signatures. We have now documented the existence of word-specific signatures in the pitch contours of Mandarin words for three different datasets, all collected from a corpus of spontaneous conversations recorded in Taipei. Word-specific pitch signatures are very prominently present in the pitch contours of two-syllable words, but also clearly visible in the pitch contours of single-syllable words. The accompanying figure visualizes the word-specific pitch signatures for four words that all share the same canonical fall-rise tones, and that have syllables to their left and right sharing the same canonical tone. The word for "chemistry" (化學, lower left panel) has a rising pitch contour where the other three words have a falling pitch contour. The word for "now" (目前, upper right panel) shows a slight fall in the first syllable that is absent for 大學 ("university") and 問题 ("question").
These findings raise the question of why such word-specific signatures in tonal realization are present. One possibility that we investigated is that these signatures are due to the physics of the sounds in Mandarin words. However, the effects of consonants and vowels on tones reported in the literature have little explanatory value for our datasets. A very different possibility is that words' meanings are at issue. Words' individual pitch signatures make spoken words more distinct, which might attenuate for the listener the uncertainties that come with the widespread homophony that characterizes the Mandarin lexicon. For instance, the syllable jiù, realized with a falling tone, has meanings as diverse as 就 (then), 旧 (old), 救 (rescue), 舅 (maternal uncle), 鹫 (vulture), and 臼 (mortar). The writing system distinguishes between these meanings, and we argue that tonal signatures might have a similar function.
In computational semantics, word meanings are represented by high-dimensional numeric vectors known as embeddings. Current large language models make it possible to calculate word embeddings that are fine-tuned to the contexts in which words occur. We made use of this technology to calculate, for the word tokens in our corpus, the corresponding contextualized embeddings. We paired these semantic high-dimensional token-specific vectors with the corresponding vectors of pitch contours. It turns out pitch vectors can be predicted from their corresponding contextualized embeddings with an accuracy that far exceeds a randomization baseline. Furthermore, a given canonical pitch contour as predicted by a generalized additive model is predicted remarkably precisely from the centroid of the embeddings of the words sharing that canonical pitch contour.
The SUBLIMINAL project is also investigating the structure of the semantic space of Mandarin, using word embeddings. To bring out the specifics of the Mandarin space as compared to the semantic space of English, we created a dataset with large numbers of words across high-level semantic categories, such as words for animals, words for body parts, and words for vehicles. We used the centroids of the embeddings of the words in these categories to set up a procrustes rotation, and then applied this rotation to all words in our dataset. Visualization of the resulting joint semantic space using stochastic t-distributed neighbor-embedding revealed not only the expected global similarities in the cognitive structures of the semantic spaces of Mandarin and English, but also many subtle but well-interpretable differences in how the two languages conceptualize the world. For instance, Mandarin kinship terms are distributed in two distinct clusters, both of which are well-differentiated from the English kinship terms. One cluster of Mandarin kinship terms denotes the closest family members (e.g. 爸爸 (bàba) "dad" and 妈妈 (māmā) "mom"). By contrast, words for more distant family members are predominant in the other cluster. Some of these words are often used for polite referencing of non-relatives, such as 阿姨 (āyí), "aunt, mother’s sister," a polite form of address used for caretakers in the home and friends of one's parents.