European Commission logo
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS

Non-sequence models for tokenization replacement

Periodic Reporting for period 3 - NonSequeToR (Non-sequence models for tokenization replacement)

Período documentado: 2020-10-01 hasta 2022-03-31

Natural language processing (NLP) is concerned with
computer-based processing of natural language, with
applications such as human-machine interfaces and
information access. The capabilities of NLP are currently
severely limited compared to humans. NLP has high error
rates for languages that differ from English (e.g.
languages with higher morphological complexity like Czech)
and for text genres that are not well edited (or noisy) and
that are of high economic importance, e.g. social media
text.

NLP is based on machine learning, which requires as basis a
representation that reflects the underlying
structure of the domain, in this case the structure of
language. But representations currently used
are symbol-based: text is broken into surface forms
by sequence models that implement tokenization
heuristics and treat each surface form as a symbol or
represent it as an embedding (a vector
representation) of that symbol. These heuristics are
arbitrary and error-prone, especially for non-English and
noisy text, resulting in poor performance.

Advances in deep learning now make it possible to take the
embedding idea and liberate it from the limitations of
symbolic tokenization. In an interdisciplinary approach
based on computational linguistics, computer science and deep
learning, our goal is to design a new robust and powerful
non-symbolic text representation that captures all aspects
of form and meaning that NLP needs for successful
processing.

By creating a text representation for NLP that is not
impeded by the limitations of symbol-based tokenization, the
foundations are laid to take NLP applications like
human-machine interaction, human-human communication
supported by machine translation and information access to
the next level.
We conceived of a new method for learning embeddings (i.e.
distributed representations) of subwords and words: learning
by concept induction. Applying this method to a highly
parallel text corpus, we learned semantic representations
for 1259 different languages in a single common space. An
extensive experimental evaluation on crosslingual word
similarity and sentiment analysis indicated that
concept-based multilingual embedding learning performs very well.

We also addressed the problem that natural language
processing systems struggle to understand rare words. This
is partly due to the fact that symbolic representations are
difficult to learn for words that have only a few
occurrences. To address this, we introduced Bertram, a
powerful architecture that is capable of inferring
high-quality embeddings for rare words that are suitable as
input representations for deep language models. This is
achieved by enabling the surface form (i.e. character
string) and contexts of a word to interact with each other
in a deep architecture. Large performance increases are
achieved on various NLP tasks.

There has been little work on modeling the morphological
well-formedness of derivatives, a problem judged to be
complex and difficult in linguistics. This is partly due to
the fact that derivatives
are word units that are often unattested in training text.
Again, this is an aspect of the problem that it is hard to
learn good representations for rare words.
We presented a graph auto-encoder that learns embeddings
capturing information about the compatibility of affixes and
stems in derivation. The auto-encoder models morphological
well-formedness in English surprisingly well. We showed that
character-level information is crucial for solving this
task.
One important focus for the second half of the project will be multilingual representations. While we established a new state of the art for multilingual representations covering more than 1000 languages, performance overall is still at a level much lower than human performance. We aim to make progress on a number of fronts here, including character-level representations and multilevel representations.