Our paper on indirect data leakage in LLMs (Balloccu et al., EACL 2024) pointed out a problem completely overlooked by previous research, which resulted in a large social media publicity and an award at the top-tier EACL 2024 conference, where it was presented. We believe this will result in more careful handling of closed-source LLMs, at least within the natural language processing community.
Following from this research and other work involving LLMs as evaluators, we show a novel approach to evaluating LLMs using ad-hoc datasets and combined assessment by a strong LLM and humans (Kasner & Dusek, ACL 2024), focused on identifying erroneous words or phrases (as opposed to rating whole texts on a scale). The use of ad-hoc data, newly collected every time, bypasses the data contamination/leakage issue in LLMs. While our approach has seen some interest in the community, we are currently extending the human evaluation interface (Kasner et al., INLG 2024) and optimizing the LLM application in evaluation.
Our work on LLMs applied in task-oriented dialogue (Hudecek & Dusek, SIGDIAL 2023) presents an entirely new approach to the problem. It allows much wider application of task-oriented dialogue systems, as it only needs a few training examples. Unlike simple LLM prompting, it still maintains database access and provides correct search results. The approach is successful and attracted a lot of attention in the community, but still has room for improvement. We are considering incorporating enhancements from our previous work on interpretable dialogue modelling.
We presented an approach for data-to-text generation with a fully neural, interpretable pipeline that works with no in-domain training data (Kasner & Dusek, ACL 2022; Kasner et al., EACL 2023). This is something that simply was not possible with any previous approach, and it allows much wider access to data-to-text generation technology. The result was overshadowed by LLMs (ChatGPT was introduced a few months after our first paper on this topic), even though it still provides superior performance on this particular task. We now conduct further research on interpretable NLG with LLMs, leveraging LLMs’ code generation abilities (Warczynski et al., INLG 2024). This shows promise but needs further extensions to retain accuracy, interpretability and generality at the same time.
We also introduced a “critic” approach to decoding from any generative language model, which detects when the language model is making a mistake and steers it away from that (Lango & Dusek, EMNLP 2023). This is a minor improvement but allows an in-place fix for any existing system, with minimal changes to the output.