Periodic Reporting for period 1 - ERINIA (Evaluating the Robustness of Non-Credible Text Identification by Anticipating Adversarial Actions)
Berichtszeitraum: 2022-11-01 bis 2024-10-31
The goal of the ERINIA project is to explore the robustness of text classifiers in this application area by investigating methods for detecting adversarial examples. Such methods aim to perform small perturbations to a given text piece, so that its meaning is preserved, but the output of the investigated classifier is reversed. To that end, previously unexplored directions will be pursued, including training reinforcement learning solutions and leveraging research on simplification and style transfer. Finally, the developed tools will be used to check the robustness of the current state-of-the-art misinformation detection solutions.
The project includes a range of training activities for the researcher and a plan for dissemination of the obtained results to various research communities. It also takes into account the society at large, as the project outcomes can inform further discussion on whether automatic content filtering is a viable solution to the misinformation problem.
1) Assessing the vulnerability of common text classification approaches to generic adversarial attacks.
This task was based on the creation of BODEGA - an evaluation framework covering various text classification approaches, misinformation detection tasks and adversarial example generators. It was used to perform a systematic analysis of the vulnerabilities that exist right now, as well as set up an environment for testing future solutions in this area.
2) Preparing adversarial attacks specifically tuned for misinformation detection tasks.
This direction was explored through an organisation of InCrediblAE: a shared task within CheckThat! evaluation lab at the CLEF 2024 conference. Each participating team had to generate adversarial examples in an environment based on BODEGA, with two extensions: an additional dataset (COVID-19 misinformation) and victim classifier (adversarially fine-tuned RoBERTa). The task participants were preparing solutions specifically designed for the challenge and easily outperformed the generic approaches tested in BODEGA.
3) Exploring the adaptive generation of adversarial examples using reinforcement learning
This task is motivated by an observation of misinformation landscape: a lot of misleading information found in social media does not come from individual confused users, but professional institutions specialised in spreading misinformation. Such actors are able to accumulate their knowledge on weaknesses in content filtering approaches and use it to perform increasingly targeted attacks. In order to test robustness against such attacks we created XARELLO: a generator of adversarial examples using reinforcement learning to observe the successful attacks and tune its output to a particular victim classifier.
4) Using large language models to improve meaning preservation of adversarial examples
The results of the initial work revealed that some models struggle to generate adversarial examples that preserve the meaning of the original text. In this stream of work we focused on improving this aspect by using large generative models, such as LLAMA or GEMMA, to provide initial paraphrases. These are then decomposed into small changes, which are subsequently applied to the original text until an adversarial example is found.
1) Assessing the vulnerability of common text classification approaches to generic adversarial attacks.
The experiments with BODEGA revealed that indeed, popular text classification approaches are very vulnerable to adversarial examples. Depending on a particular credibility assessment task, between 60% and 90% of text samples can be manipulated in such a way that changes the victim's decision on their credibility. However, many of these changes require numerous queries or harm semantic similarity, indicating possible defence techniques. More worryingly, classifiers built by fine-tuning modern large language models are not more robust than their simpler antecedents. Our experiments were the first to systematically analyse this problem.
2) Preparing adversarial attacks specifically tuned for misinformation detection tasks.
The InCrediblAE shared task attracted six teams from universities around the world. Their solutions were more effective than generic approaches in both confusing the victim classifier and finding small changes to make, establishing the new state of the art for this task. However, manual annotation revealed that solution scored low on automatic evaluation can also be highly effective in the eyes of humans, unlikely to notice changes based on individual characters. Overall the shared task framework allowed to expand the reach of the project by introducing these problems to new researchers.
3) Exploring the adaptive generation of adversarial examples using reinforcement learning
The solution prepared in this work, XARELLO, allowed to discover new vulnerabilities of text classifiers deployed in content filtering applications. We have shown that if an attacker is allowed to interact with the victim model (adaptation phase), it can use the information gathered to find adversarial examples (attack phase) that are of higher quality and require less attempts. For example, in the fact-checking task, when a previously best approach would require querying the victim model 130-150 times to generate one modification, XARELLO finds a successful example in 5-7 queries.
4) Using large language models to improve meaning preservation of adversarial examples
In order to improve meaning preservation, we tested a variety of large generative models and prompting commands. Moreover, for improving the realism of the evaluation scenarios, the attackers were allowed to ask a limited number of queries, in line with policies of major social media platforms. The results indicated that our solution (TREPAT) excelled in this setup, especially when the input text is too long to allow an exhaustive search of word replacements.
Taken together, our work has shown various weaknesses of the text classification approaches and highlighted possible attack scenarios. We hope this work will aid building more robust and reliable solutions for managing user-generated content.