Evaluating the Robustness of Non-Credible Text Identification by Anticipating Adversarial Actions

Informations projet

ERINIA

N° de convention de subvention: 101060930

DOI

10.3030/101060930

Projet clôturé

Date de signature de la CE 8 Juillet 2022

Date de début 1 Novembre 2022

Date de fin 31 Octobre 2024

Financé au titre de

Marie Skłodowska-Curie Actions (MSCA)

Coût total

Aucune donnée

Contribution de l’UE

€ 165 312,96

Coordonné par

UNIVERSIDAD POMPEU FABRA
Spain

Periodic Reporting for period 1 - ERINIA (Evaluating the Robustness of Non-Credible Text Identification by Anticipating Adversarial Actions)

Période du rapport: 2022-11-01 au 2024-10-31

As challenges posed by misinformation become apparent in the modern digital society, state-of-the-art methods of Artificial Intelligence, especially Natural Language Processing (NLP) and Machine Learning, are considered as countermeasures. Indeed, previous research has shown that NLP solutions can detect phenomena such as fake news, social media bots or usage of propaganda techniques. However, little attention has been given to the robustness of these approaches, which is especially important in the case of deliberate misinformation, whose authors would likely attempt to deceive any automatic filtering algorithm to achieve their goals.

The goal of the ERINIA project is to explore the robustness of text classifiers in this application area by investigating methods for detecting adversarial examples. Such methods aim to perform small perturbations to a given text piece, so that its meaning is preserved, but the output of the investigated classifier is reversed. To that end, previously unexplored directions will be pursued, including training reinforcement learning solutions and leveraging research on simplification and style transfer. Finally, the developed tools will be used to check the robustness of the current state-of-the-art misinformation detection solutions.

The project includes a range of training activities for the researcher and a plan for dissemination of the obtained results to various research communities. It also takes into account the society at large, as the project outcomes can inform further discussion on whether automatic content filtering is a viable solution to the misinformation problem.

The work performed in the project was organised into four stages:

1) Assessing the vulnerability of common text classification approaches to generic adversarial attacks.

This task was based on the creation of BODEGA - an evaluation framework covering various text classification approaches, misinformation detection tasks and adversarial example generators. It was used to perform a systematic analysis of the vulnerabilities that exist right now, as well as set up an environment for testing future solutions in this area.

2) Preparing adversarial attacks specifically tuned for misinformation detection tasks.

This direction was explored through an organisation of InCrediblAE: a shared task within CheckThat! evaluation lab at the CLEF 2024 conference. Each participating team had to generate adversarial examples in an environment based on BODEGA, with two extensions: an additional dataset (COVID-19 misinformation) and victim classifier (adversarially fine-tuned RoBERTa). The task participants were preparing solutions specifically designed for the challenge and easily outperformed the generic approaches tested in BODEGA.

3) Exploring the adaptive generation of adversarial examples using reinforcement learning

This task is motivated by an observation of misinformation landscape: a lot of misleading information found in social media does not come from individual confused users, but professional institutions specialised in spreading misinformation. Such actors are able to accumulate their knowledge on weaknesses in content filtering approaches and use it to perform increasingly targeted attacks. In order to test robustness against such attacks we created XARELLO: a generator of adversarial examples using reinforcement learning to observe the successful attacks and tune its output to a particular victim classifier.

4) Using large language models to improve meaning preservation of adversarial examples

The results of the initial work revealed that some models struggle to generate adversarial examples that preserve the meaning of the original text. In this stream of work we focused on improving this aspect by using large generative models, such as LLAMA or GEMMA, to provide initial paraphrases. These are then decomposed into small changes, which are subsequently applied to the original text until an adversarial example is found.

The results of the work performed in the four stages described above are as follows:

1) Assessing the vulnerability of common text classification approaches to generic adversarial attacks.

The experiments with BODEGA revealed that indeed, popular text classification approaches are very vulnerable to adversarial examples. Depending on a particular credibility assessment task, between 60% and 90% of text samples can be manipulated in such a way that changes the victim's decision on their credibility. However, many of these changes require numerous queries or harm semantic similarity, indicating possible defence techniques. More worryingly, classifiers built by fine-tuning modern large language models are not more robust than their simpler antecedents. Our experiments were the first to systematically analyse this problem.

2) Preparing adversarial attacks specifically tuned for misinformation detection tasks.

The InCrediblAE shared task attracted six teams from universities around the world. Their solutions were more effective than generic approaches in both confusing the victim classifier and finding small changes to make, establishing the new state of the art for this task. However, manual annotation revealed that solution scored low on automatic evaluation can also be highly effective in the eyes of humans, unlikely to notice changes based on individual characters. Overall the shared task framework allowed to expand the reach of the project by introducing these problems to new researchers.

3) Exploring the adaptive generation of adversarial examples using reinforcement learning

The solution prepared in this work, XARELLO, allowed to discover new vulnerabilities of text classifiers deployed in content filtering applications. We have shown that if an attacker is allowed to interact with the victim model (adaptation phase), it can use the information gathered to find adversarial examples (attack phase) that are of higher quality and require less attempts. For example, in the fact-checking task, when a previously best approach would require querying the victim model 130-150 times to generate one modification, XARELLO finds a successful example in 5-7 queries.

4) Using large language models to improve meaning preservation of adversarial examples

In order to improve meaning preservation, we tested a variety of large generative models and prompting commands. Moreover, for improving the realism of the evaluation scenarios, the attackers were allowed to ask a limited number of queries, in line with policies of major social media platforms. The results indicated that our solution (TREPAT) excelled in this setup, especially when the input text is too long to allow an exhaustive search of word replacements.

Taken together, our work has shown various weaknesses of the text classification approaches and highlighted possible attack scenarios. We hope this work will aid building more robust and reliable solutions for managing user-generated content.

Periodic Reporting for period 1 - ERINIA (Evaluating the Robustness of Non-Credible Text Identification by Anticipating Adversarial Actions)

Télécharger Télécharger le contenu de la page