Periodic Reporting for period 4 - SEMANTAX (Form-Independent Semantics for Natural Language Understanding)
Período documentado: 2022-02-01 hasta 2023-07-31
artificial intelligence (AI) dedicated to making computers capable of
analysing and generating text and speech, allowing users to access
information from texts that, unaided, they would not have time to
read. The problem of unhindered access to information by its citizens
is a problem of vital importance to a democratic and dynamic society.
Its provision requires components such as parsers and language models,
named-entity recognisers, and knowledge-graphs. More recently, Large
Language Models such as GPT have revolutionized NLP.
Large Language Models (LLMs) work by representing units of text such
as words, sentences, or characters in terms contexts of other such
units in which they occur in vast amounts of text as "embeddings" of a
few hundred dimensions. Such models can be trained as sequential
models, the ability that underlies their fluency when used to generate
text, and that also underlies their tendency to "hallucinate",
producing plausible but factually incorrect text. They can be further
trained by fine-tuning for specific NLP task.
However, LLMs cost millions of dollars, needing vast computational
resources and text data to train. They are hard to update without
retraining from scratch, and slow in response. They work in a quite
different way to people. There is a need to investigating less data-
and compute-intensive methods. Since children learn language on the
basis of far less data, there is continuing interest in modeling
something more like human language understanding.
Unfortunately, we have very little idea of how to represent the
meanings associated with human understanding in a form that makes it
as obvious to the machine as it is to the human that text about one
company buying another answers to a question about their owning it.
Such "common-sense" implications are too obvious to human readers to
ever need to be explicitly stated in text. This makes it unlikely
that LLMs, merely by representing the contexts of use of relations
like "buy" and "own", actually represent the meaning of those words in
this sense.
The main emphasis of the project was to investigate an alternative
approach, using "machine-reading" with wide-coverage parsers to
extract relations between typed named entities from large amounts of
unlabeled text describing the same events from multiple sources, and
looking for evidence of entailment between those relations, based on
overlap between the sets of named-entity tuples that ground the
relations in the text. (For example, the companies in the "buy"
relation are included among those in the "own" relation, but not vice
versa.) On this basis, we construct a large entailment graph,
providing a semantics for question-answering (QA).
A further objective was to investigate the potential for LLMs to do
the same job,
Further objectives were to develop the Combinatory Categorial Grammar (CCG) parsers used to build the
graphs and in QA, and to develop CCG as an explanatory theory of grammar of use
to linguists and in the study of cognition more broadly.
Chinese, developing novel methods for completion of the local
entailment graph. These methods scale to larger datasets, and are
evaluated for Natural Language Inference (NLI) and Question Answering
(QA), using Knowledge graphs built from Wikipedia.
We have further developed the method using embeddings in hybrid
symbolic-neural models, and explored extentions to temporal relations.
We show for a range of LLMs that much of the claimed performance in
NLI is artefactual, arising from inherent biases in building and fine
tuning the models to the task.
We show that the strongest entailment models are hybrids combining the
relatively high precision of entailment graphs with the high recall of
the language models, either by "backing-off" to them when the
entailment graph cannot answer, or by "smoothing" predicates missing
from the graph using LLMs
We made the work-horse CCG parser fully incremental. In work with
John Hale we have investigated psycholinguistic predictions of this parser.
We have collected and curated large News corpora for English, German,
and Chinese.
The linguistic and computational theory of CCG was developed
extensively. We also investigated implications of CCG tools for
under-resourced languages, processing disfluent speech and music, and
language acquisition.
course of the project improve on the prior state-of-the-art.
Our novel hybrid models of entailment and natural language inference,
combining symbolic representations with neural Large Language Models,
achieve these results, and point the way for future research.
The strength of these results and the interdisciplinary impact of the
CCG theory of grammar and the associated parsers is attested by over
fifty refereed publications arising from the project.