Periodic Reporting for period 1 - M-FleNS (Multilingual Flexible Neuro-Symbolic Language Generation)
Período documentado: 2022-09-01 hasta 2024-08-31
At the time the proposal was drafted (mid-2021), state-of-the-art systems for data-to-text generation were neural machine learning methods (e.g LSTMs) and fine-tuned, or even off-the-shelf, small-sized language models (e.g. T5). These systems needed significant amounts of resources (data, energy, compute), and struggled with accuracy, biases, low-resource settings or out-of-domain data. In late 2022, three months after the M-FleNS project started, very large instruction-tuned language models were made available, and the landscape of NLG changed drastically: these models were able to produce human-like texts for a number of languages in a zero-shot setting, making them widely adopted, despite their tremendous resource greed (both at learning and execution time). The sudden emergence of very large language models (VLLMs) had two main consequences for the project:
- The challenge of improving the quality of data-to-text systems became much smaller, so we dedicated more effort to the aspects on which VLLMs are still falling short, namely energy-efficiency and very low resource setting (as it is the case for Irish).
- VLLMs are now extremely popular, but are black boxes, and knowing how to evaluate the quality of the texts they produce is more crucial than ever. Creating resources and methods for human evaluation of text quality naturally became a focal point of the project.
The main scientific objectives of the project are the following:
1- Improve and extend the existing FORGe rule-based NLG system, which is very energy-efficient although it generally lacks fluency; the system should be made as language-independent as possible, and produce outputs in English, Irish and French.
2- Combine rule-based and (deep-)learning techniques for improving the fluency of the rule-based system while keeping the resource requirements low.
3- Make available a range of automatic and human evaluation methods and resources for assessing the quality of the texts produced by any type of NLG system.
1- Rule-based generation:
- We developed the first NLG system for Irish. Our submission was the runner up in an international shared task in which systems were evaluated on the WebNLG benchmark, which contains inputs made of subsets of 400 distinct DBpedia properties. We ranked above several LLM-based systems according to automatic and human evaluations, only outperformed by a resource-hungry combination of GPT3.5 and Google Translate.
- We updated the generator to have more independent modules: the proportion of language-independent rules of the FORGe system was increased by over 11%, and the total number of rules increased by over 13% (2,852 rules at the end of the project, 81.3% being language-independent).
2- Neuro-symbolic generation:
- Several datasets have been released, in particular a large (15K data points) parallel (English, Irish, French) dataset of rich linguistic annotations (10 levels of representation for each data point: semantics, syntax, morphology etc.).
- 48 paraphrasing modules were fine-tuned on the data and combined with the rule-based generator. Our best combination with a small-sized model obtained scores very close to systems that use up to 4,000 times more parameters.
- 58 lightweight modules for the Text Structuring task (i.e. grouping the input properties into sentences) were developed; we managed to outperform the accuracy of the pre-LLM neural state of the art (but not VLLMs) with much less resources.
- A fully functional online demonstration of multilingual in-parallel rule-based and LLM-based generation of Wikipedia page stubs was released.
3- Evaluation of NLG systems:
- Human evaluation criteria were defined and tested, and a pipeline for recruiting human evaluators was developed and released.
- Code was made available to easily use a set of data-to-text NLG metrics.
- Contributions to important advances to human evaluation procedures have been made, such as the development of a tutorial on human evaluation in NLP, the development of a standard methodology for selecting quality criteria for human evaluations, and a study about issues on reproducibility of NLP experiments.
- First symbolic generation tools for the Irish Language. Impact: It is important that the tool is available and usable for it to have any impact. We created an online tool connecting Semantic Web resources and the project’s NLG pipeline, available from GitHub.
- New standard for selecting quality criteria for human evaluations. Impact: To ensure that the quality criteria for evaluation are used as widely as possible, (i) we are in the process of making the online version of the QCET tool publicly available, and (ii) in collaboration with the ISO/IEC JTC 1/SC 42 technical committee, the taxonomy of quality criteria is being established as a new standard for human evaluation of NLP systems, under the Artificial Intelligence — Evaluation methods for accurate natural language processing systems ISO standard (ISO/IEC AWI 23282).
- New reference resources for human evaluation in NLP. Impact: We released the resources that correspond to the 8 units of a tutorial on human evaluation, which was presented for the first time at INLG 2024 in Tokyo. The tutorial has been submitted to other venues already, and is being considered for a class at the Institute for Statistical and Data Science.
- First parallel multilingual (English, Irish, French) and multi-layered dataset on 10 levels of linguistic annotation (~2 million nodes per language). Impact: These datasets can be accessed freely from the Mod-D2T GitHub. They were released in a standard CoNLL-U format so they can be used (or converted) for instance for Machine Translation, text generation, text understanding (parsing, semantic role labeling, etc.), LLM controllability, or for training smaller specialised modules.