The results of the work performed in the four stages described above are as follows:
1) Assessing the vulnerability of common text classification approaches to generic adversarial attacks.
The experiments with BODEGA revealed that indeed, popular text classification approaches are very vulnerable to adversarial examples. Depending on a particular credibility assessment task, between 60% and 90% of text samples can be manipulated in such a way that changes the victim's decision on their credibility. However, many of these changes require numerous queries or harm semantic similarity, indicating possible defence techniques. More worryingly, classifiers built by fine-tuning modern large language models are not more robust than their simpler antecedents. Our experiments were the first to systematically analyse this problem.
2) Preparing adversarial attacks specifically tuned for misinformation detection tasks.
The InCrediblAE shared task attracted six teams from universities around the world. Their solutions were more effective than generic approaches in both confusing the victim classifier and finding small changes to make, establishing the new state of the art for this task. However, manual annotation revealed that solution scored low on automatic evaluation can also be highly effective in the eyes of humans, unlikely to notice changes based on individual characters. Overall the shared task framework allowed to expand the reach of the project by introducing these problems to new researchers.
3) Exploring the adaptive generation of adversarial examples using reinforcement learning
The solution prepared in this work, XARELLO, allowed to discover new vulnerabilities of text classifiers deployed in content filtering applications. We have shown that if an attacker is allowed to interact with the victim model (adaptation phase), it can use the information gathered to find adversarial examples (attack phase) that are of higher quality and require less attempts. For example, in the fact-checking task, when a previously best approach would require querying the victim model 130-150 times to generate one modification, XARELLO finds a successful example in 5-7 queries.
4) Using large language models to improve meaning preservation of adversarial examples
In order to improve meaning preservation, we tested a variety of large generative models and prompting commands. Moreover, for improving the realism of the evaluation scenarios, the attackers were allowed to ask a limited number of queries, in line with policies of major social media platforms. The results indicated that our solution (TREPAT) excelled in this setup, especially when the input text is too long to allow an exhaustive search of word replacements.
Taken together, our work has shown various weaknesses of the text classification approaches and highlighted possible attack scenarios. We hope this work will aid building more robust and reliable solutions for managing user-generated content.