(1) MASSIVELY MULTILINGUAL REPRESENTATIONS. We conceived of
a new method for learning embeddings (i.e. distributed
representations) of subwords and words: learning by concept
induction. Applying this method to a highly parallel text
corpus, we learned semantic representations for 1259
different languages in a single common space. An extensive
experimental evaluation on crosslingual word similarity and
sentiment analysis indicated the excellent performance of concept-based multilingual
embedding learning. Publication:
Philipp Dufter, Mengjie Zhao, Martin Schmitt, Alexander
M. Fraser, Hinrich Schütze:
Embedding Learning Through Multilingual Concept
Induction. ACL (1) 2018: 1520-1530.
(2) LOW-RESOURCE FOUNDATION MODEL. We presented the first
open source foundation model that covers
hundreds of low-resource languages across the major language
families of the world. We created several data resources for
this project, which are publicly available to the extent
that copyright law permits. We also developed a new
evaluation methodology for this setting (i.e. low-resource
languages that lack human-annotated data). As part of this,
we published the TAXI1500 evaluation dataset for
multilingual classification. Finally, we published GlotLID,
a state of the art language identification resource with
unprecedented coverage of more than 1500 languages.
Publication:
Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia
Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma,
Helmut Schmid, André F. T. Martins, François Yvon, Hinrich
Schütze:
Glot500: Scaling Multilingual Corpora and Language Models to
500 Languages. ACL (1) 2023: 1082-1117
(3) RARE WORD COVERAGE OF FOUNDATION MODELS. We addressed the problem that natural language
processing systems struggle to understand rare words. This
is partly due to the fact that symbolic representations are
difficult to learn for words that have only a few
occurrences. To address this, we introduced Bertram, a
powerful architecture that is capable of inferring
high-quality embeddings for rare words that are suitable as
input representations for deep language models. This is
achieved by enabling the surface form (i.e. character
string) and contexts of a word to interact with each other
in a deep architecture. Large performance increases are
achieved on various NLP tasks.
Publication:
Rare words: A major problem for contextualized embeddings
and how to fix it by attentive mimicking.
T Schick, H Schütze.
Proceedings of the AAAI Conference on Artificial
Intelligence 34.
(4) MORPHOLOGY. There has been little work on modeling the morphological
well-formedness of derivatives, a problem judged to be
complex and difficult in linguistics. This is partly due to
the fact that derivatives
are word units that are often unattested in training text.
Again, this is an aspect of the problem that it is hard to
learn good representations for rare words.
We presented a graph auto-encoder that learns embeddings
capturing information about the compatibility of affixes and
stems in derivation. The auto-encoder models morphological
well-formedness in English surprisingly well. We showed that
character-level information is crucial for solving this
task.
Publication:
Valentin Hofmann, Hinrich Schütze, Janet B. Pierrehumbert:
A Graph Auto-encoder Model of Derivational Morphology. ACL
2020: 1127-1138.
(5) PROMPTING. We arguably presented the first work on prompt
engineering, i.e. the methodology for creating text that a
foundation model can interpret and understand in the context
of solving a specific task. Prompt engineering is one of the
major new developments of the last few years in natural
language processing and it is an essential component of
using foundation models. Our work addresses a number of
different representation issues that need to be solved for
effective prompting.
Publication:
Timo Schick, Hinrich Schütze:
Exploiting Cloze-Questions for Few-Shot Text Classification
and Natural Language Inference. EACL 2021: 255-269. (First
published in January of 2020.)