Periodic Reporting for period 1 - NG-NLG (Next-Generation Natural Language Generation)
Berichtszeitraum: 2022-04-01 bis 2024-09-30
In detail, the objectives of NG-NLG are to (1) make NLG more accurate and interpretable, (2) enable practical reasoning over the data, (3) allow fast model adaptation, (4) produce efficient models and (5) allow reliable evaluation. This will turn LLMs from a hyped but ultimately unreliable and limited text composition tool, into one that can be trusted with factual accuracy and raw data processing.
Furthermore, we aim at creating better awareness about the inner working (and ultimately very basic underlying principles) of current LLMs.
While the LLM development was not foreseen by the project, we now focus on using LLMs as a tool and expand them in various ways, aiming at greater explainability and accuracy. We are also examining LLM’s capabilities and limits. We still consider smaller language models and include them in our experiments, due to their practical size and efficiency.
NG-NLG project website: https://ufal.mff.cuni.cz/grants/ng-nlg(öffnet in neuem Fenster)
The project pushed the state of the art with smaller pretrained neural language models. We presented a pipeline approach for neural generation without the need for in-domain training data, building on simple handcrafted templates expressing a single fact each, then editing and aggregating them into the final text. We further extended the approach, replacing the template step with a neural model, thus creating a fully neural approach for data-to-text generation with high accuracy and low training data requirements. We further introduced a new “critic” generation approach, where an existing language model can be steered by a specially trained classifier to produce more factually accurate outputs.
With the introduction of LLMs, we launched several works examining their applications and limits. We created the first LLM-based system for task-oriented dialogue, using LLM’s capabilities to learn from just a few examples while still maintaining grounding in explicit dialogue state and database search. We evaluated LLMs on data-to-text generation and showed their limits – with complex input data, LLMs tend to hallucinate, i.e. produce plausible but factually incorrect text. We exploit LLMs’ code-generation capabilities to generate text composition rules. We also raise awareness about LLMs’ evaluation in the research community – we were the first to describe the problem of indirect data leaks, where evaluation benchmark data may get into LLMs’ training data as vendors are using user inputs to improve their models.
We also developed several ways of using LLMs to evaluate generated text, including simple scoring, as well as extraction/annotation of error-specific spans, which currently appears to be the most reliable approach with respect to evaluating factual accuracy. Connected to evaluation efforts, the project is also involved in developing better practices in the NLP community for experiment reproducibility and more insightful results analysis.
Some of our results show interdisciplinary applications: We experiment in the medical domain, examining NLG usage for counselling. We also created systems that integrate NLG within marketing and online promotion scenarios, from sentiment alteration in reviews, to automatic recommender system explanations and clickbait headline analysis.
Following from this research and other work involving LLMs as evaluators, we show a novel approach to evaluating LLMs using ad-hoc datasets and combined assessment by a strong LLM and humans (Kasner & Dusek, ACL 2024), focused on identifying erroneous words or phrases (as opposed to rating whole texts on a scale). The use of ad-hoc data, newly collected every time, bypasses the data contamination/leakage issue in LLMs. While our approach has seen some interest in the community, we are currently extending the human evaluation interface (Kasner et al., INLG 2024) and optimizing the LLM application in evaluation.
Our work on LLMs applied in task-oriented dialogue (Hudecek & Dusek, SIGDIAL 2023) presents an entirely new approach to the problem. It allows much wider application of task-oriented dialogue systems, as it only needs a few training examples. Unlike simple LLM prompting, it still maintains database access and provides correct search results. The approach is successful and attracted a lot of attention in the community, but still has room for improvement. We are considering incorporating enhancements from our previous work on interpretable dialogue modelling.
We presented an approach for data-to-text generation with a fully neural, interpretable pipeline that works with no in-domain training data (Kasner & Dusek, ACL 2022; Kasner et al., EACL 2023). This is something that simply was not possible with any previous approach, and it allows much wider access to data-to-text generation technology. The result was overshadowed by LLMs (ChatGPT was introduced a few months after our first paper on this topic), even though it still provides superior performance on this particular task. We now conduct further research on interpretable NLG with LLMs, leveraging LLMs’ code generation abilities (Warczynski et al., INLG 2024). This shows promise but needs further extensions to retain accuracy, interpretability and generality at the same time.
We also introduced a “critic” approach to decoding from any generative language model, which detects when the language model is making a mistake and steers it away from that (Lango & Dusek, EMNLP 2023). This is a minor improvement but allows an in-place fix for any existing system, with minimal changes to the output.