Next-Generation Natural Language Generation

Informazioni relative al progetto

NG-NLG

ID dell’accordo di sovvenzione: 101039303

DOI

10.3030/101039303

Data della firma CE 31 Marzo 2022

Data di avvio 1 Aprile 2022

Data di completamento 31 Marzo 2027

Finanziato da

European Research Council (ERC)

Costo totale

€ 1 420 375,00

Contributo UE

€ 1 420 375,00

1 420 375,00

Coordinato da

UNIVERZITA KARLOVA
Czechia

Periodic Reporting for period 1 - NG-NLG (Next-Generation Natural Language Generation)

Periodo di rendicontazione: 2022-04-01 al 2024-09-30

The main aim of the NG-NLG project is to make text generation systems (natural language generation, NLG) much more broadly useful, to allow wide adoption of the technology. This will enable anyone to get automatic description of data in domains such as weather, business, sports, or product descriptions. The project was conceived in a situation where NLG was extremely expensive and only accessible for large companies. The only options were handcrafted systems, which require a lot of manual work and are limited in scope, or neural systems, requiring substantial amounts of training data and showing poor reliability. With the introduction of large language models (LLMs) such as GPT-3.5/4 the neural systems can learn from just a few examples and are very accessible, even to the public. However, they are still unreliable and hard to control. They suffer from hallucinations (producing text not grounded in facts), do not adhere to the required task, and are unable to reason over the input data. More research is needed to find the LLMs’ realistic performance limits, as many current benchmarks have leaked into their training data, giving them an unfair advantage.

In detail, the objectives of NG-NLG are to (1) make NLG more accurate and interpretable, (2) enable practical reasoning over the data, (3) allow fast model adaptation, (4) produce efficient models and (5) allow reliable evaluation. This will turn LLMs from a hyped but ultimately unreliable and limited text composition tool, into one that can be trusted with factual accuracy and raw data processing.
Furthermore, we aim at creating better awareness about the inner working (and ultimately very basic underlying principles) of current LLMs.

While the LLM development was not foreseen by the project, we now focus on using LLMs as a tool and expand them in various ways, aiming at greater explainability and accuracy. We are also examining LLM’s capabilities and limits. We still consider smaller language models and include them in our experiments, due to their practical size and efficiency.

NG-NLG project website: https://ufal.mff.cuni.cz/grants/ng-nlg

Our work has proceeded along several axes, looking into neural models’ and LLMs’ factual accuracy during the generation, examining new evaluation methods, and experimenting with direct applications of our approaches.

The project pushed the state of the art with smaller pretrained neural language models. We presented a pipeline approach for neural generation without the need for in-domain training data, building on simple handcrafted templates expressing a single fact each, then editing and aggregating them into the final text. We further extended the approach, replacing the template step with a neural model, thus creating a fully neural approach for data-to-text generation with high accuracy and low training data requirements. We further introduced a new “critic” generation approach, where an existing language model can be steered by a specially trained classifier to produce more factually accurate outputs.

With the introduction of LLMs, we launched several works examining their applications and limits. We created the first LLM-based system for task-oriented dialogue, using LLM’s capabilities to learn from just a few examples while still maintaining grounding in explicit dialogue state and database search. We evaluated LLMs on data-to-text generation and showed their limits – with complex input data, LLMs tend to hallucinate, i.e. produce plausible but factually incorrect text. We exploit LLMs’ code-generation capabilities to generate text composition rules. We also raise awareness about LLMs’ evaluation in the research community – we were the first to describe the problem of indirect data leaks, where evaluation benchmark data may get into LLMs’ training data as vendors are using user inputs to improve their models.

We also developed several ways of using LLMs to evaluate generated text, including simple scoring, as well as extraction/annotation of error-specific spans, which currently appears to be the most reliable approach with respect to evaluating factual accuracy. Connected to evaluation efforts, the project is also involved in developing better practices in the NLP community for experiment reproducibility and more insightful results analysis.

Some of our results show interdisciplinary applications: We experiment in the medical domain, examining NLG usage for counselling. We also created systems that integrate NLG within marketing and online promotion scenarios, from sentiment alteration in reviews, to automatic recommender system explanations and clickbait headline analysis.

Our paper on indirect data leakage in LLMs (Balloccu et al., EACL 2024) pointed out a problem completely overlooked by previous research, which resulted in a large social media publicity and an award at the top-tier EACL 2024 conference, where it was presented. We believe this will result in more careful handling of closed-source LLMs, at least within the natural language processing community.

Following from this research and other work involving LLMs as evaluators, we show a novel approach to evaluating LLMs using ad-hoc datasets and combined assessment by a strong LLM and humans (Kasner & Dusek, ACL 2024), focused on identifying erroneous words or phrases (as opposed to rating whole texts on a scale). The use of ad-hoc data, newly collected every time, bypasses the data contamination/leakage issue in LLMs. While our approach has seen some interest in the community, we are currently extending the human evaluation interface (Kasner et al., INLG 2024) and optimizing the LLM application in evaluation.

Our work on LLMs applied in task-oriented dialogue (Hudecek & Dusek, SIGDIAL 2023) presents an entirely new approach to the problem. It allows much wider application of task-oriented dialogue systems, as it only needs a few training examples. Unlike simple LLM prompting, it still maintains database access and provides correct search results. The approach is successful and attracted a lot of attention in the community, but still has room for improvement. We are considering incorporating enhancements from our previous work on interpretable dialogue modelling.

We presented an approach for data-to-text generation with a fully neural, interpretable pipeline that works with no in-domain training data (Kasner & Dusek, ACL 2022; Kasner et al., EACL 2023). This is something that simply was not possible with any previous approach, and it allows much wider access to data-to-text generation technology. The result was overshadowed by LLMs (ChatGPT was introduced a few months after our first paper on this topic), even though it still provides superior performance on this particular task. We now conduct further research on interpretable NLG with LLMs, leveraging LLMs’ code generation abilities (Warczynski et al., INLG 2024). This shows promise but needs further extensions to retain accuracy, interpretability and generality at the same time.

We also introduced a “critic” approach to decoding from any generative language model, which detects when the language model is making a mistake and steers it away from that (Lango & Dusek, EMNLP 2023). This is a minor improvement but allows an in-place fix for any existing system, with minimal changes to the output.

NG-NLG project logo

Periodic Reporting for period 1 - NG-NLG (Next-Generation Natural Language Generation)

Scarica Scarica il contenuto della pagina