Non-sequence models for tokenization replacement

Project Information

NonSequeToR

Grant agreement ID: 740516

Project website

DOI

10.3030/740516

Project closed

EC signature date 10 May 2017

Start date 1 October 2017

End date 30 September 2023

Funded under

EXCELLENT SCIENCE - European Research Council (ERC)

Total cost

€ 2 500 000,00

EU contribution

€ 2 500 000,00

2 500 000,00

Coordinated by

LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN
Germany

Periodic Reporting for period 4 - NonSequeToR (Non-sequence models for tokenization replacement)

Reporting period: 2022-04-01 to 2023-09-30

Natural language processing (NLP) is concerned with
computer-based processing of natural language, with
applications such as human-machine interfaces and
information access. The capabilities of NLP are currently
severely limited compared to humans. NLP has high error
rates for languages that differ from English (e.g.
languages with higher morphological complexity like Czech)
and for text genres that are not well edited (or noisy) and
that are of high economic importance, e.g. social media
text.

NLP is based on machine learning, which requires as basis a
representation that reflects the underlying
structure of the domain, in this case the structure of
language. But representations currently used
are symbol-based: text is broken into surface forms
by sequence models that implement tokenization
heuristics and treat each surface form as a symbol or
represent it as an embedding (a vector
representation) of that symbol. These heuristics are
arbitrary and error-prone, especially for non-English and
noisy text, resulting in poor performance.

Advances in deep learning now make it possible to take the
embedding idea and liberate it from the limitations of
symbolic tokenization. In an interdisciplinary approach
based on computational linguistics, computer science and deep
learning, our goal is to design a new robust and powerful
non-symbolic text representation that captures all aspects
of form and meaning that NLP needs for successful
processing.

By creating a text representation for NLP that is not
impeded by the limitations of symbol-based tokenization, the
foundations are laid to take NLP applications like
human-machine interaction, human-human communication
supported by machine translation and information access to
the next level.

(1) MASSIVELY MULTILINGUAL REPRESENTATIONS. We conceived of
a new method for learning embeddings (i.e. distributed
representations) of subwords and words: learning by concept
induction. Applying this method to a highly parallel text
corpus, we learned semantic representations for 1259
different languages in a single common space. An extensive
experimental evaluation on crosslingual word similarity and
sentiment analysis indicated the excellent performance of concept-based multilingual
embedding learning. Publication:
Philipp Dufter, Mengjie Zhao, Martin Schmitt, Alexander
M. Fraser, Hinrich Schütze:
Embedding Learning Through Multilingual Concept
Induction. ACL (1) 2018: 1520-1530.

(2) LOW-RESOURCE FOUNDATION MODEL. We presented the first
open source foundation model that covers
hundreds of low-resource languages across the major language
families of the world. We created several data resources for
this project, which are publicly available to the extent
that copyright law permits. We also developed a new
evaluation methodology for this setting (i.e. low-resource
languages that lack human-annotated data). As part of this,
we published the TAXI1500 evaluation dataset for
multilingual classification. Finally, we published GlotLID,
a state of the art language identification resource with
unprecedented coverage of more than 1500 languages.
Publication:
Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia
Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma,
Helmut Schmid, André F. T. Martins, François Yvon, Hinrich
Schütze:
Glot500: Scaling Multilingual Corpora and Language Models to
500 Languages. ACL (1) 2023: 1082-1117

(3) RARE WORD COVERAGE OF FOUNDATION MODELS. We addressed the problem that natural language
processing systems struggle to understand rare words. This
is partly due to the fact that symbolic representations are
difficult to learn for words that have only a few
occurrences. To address this, we introduced Bertram, a
powerful architecture that is capable of inferring
high-quality embeddings for rare words that are suitable as
input representations for deep language models. This is
achieved by enabling the surface form (i.e. character
string) and contexts of a word to interact with each other
in a deep architecture. Large performance increases are
achieved on various NLP tasks.
Publication:
Rare words: A major problem for contextualized embeddings
and how to fix it by attentive mimicking.
T Schick, H Schütze.
Proceedings of the AAAI Conference on Artificial
Intelligence 34.

(4) MORPHOLOGY. There has been little work on modeling the morphological
well-formedness of derivatives, a problem judged to be
complex and difficult in linguistics. This is partly due to
the fact that derivatives
are word units that are often unattested in training text.
Again, this is an aspect of the problem that it is hard to
learn good representations for rare words.
We presented a graph auto-encoder that learns embeddings
capturing information about the compatibility of affixes and
stems in derivation. The auto-encoder models morphological
well-formedness in English surprisingly well. We showed that
character-level information is crucial for solving this
task.
Publication:
Valentin Hofmann, Hinrich Schütze, Janet B. Pierrehumbert:
A Graph Auto-encoder Model of Derivational Morphology. ACL
2020: 1127-1138.

(5) PROMPTING. We arguably presented the first work on prompt
engineering, i.e. the methodology for creating text that a
foundation model can interpret and understand in the context
of solving a specific task. Prompt engineering is one of the
major new developments of the last few years in natural
language processing and it is an essential component of
using foundation models. Our work addresses a number of
different representation issues that need to be solved for
effective prompting.
Publication:
Timo Schick, Hinrich Schütze:
Exploiting Cloze-Questions for Few-Shot Text Classification
and Natural Language Inference. EACL 2021: 255-269. (First
published in January of 2020.)

For massively multilingual representations, a new
graph-based methodology was introduced to create massively
multilingual representations. For low-resource foundation
models, several new evaluation methodologies were proposed
and instantiated through the creation of evaluation
resources for the area of low-resource natural language
processing. For rare word coverage of foundation models, we
presented the first work that addressed rare word coverage
for foundation models. For morphology, we introduced an
innovative graph-theoretic formalization. For prompting, we
arguably presented the first work on prompt engineering and
inaugurated a new methodology in natural lanugage processing
that has proved central to foundation model research and
application.

cis-logo-1.png

Periodic Reporting for period 4 - NonSequeToR (Non-sequence models for tokenization replacement)

Share this page Share this page on social networks

Download Download the content of the page