Large-scale pre-trained language models have revolutionized the field of natural language processing (NLP). They carry great potential and promise to solve many real-life applications such as translation, search, question answering and many more. Recently, a new generation of language models, so called Large Language Models (LLMs) have been developed and released in applications like ChatGPT which reached 100 million active users within 2 months, in comparison Google Translated reached that threshold after 78 months. It is therefore important to get a deeper understanding of how those models work and in particular in what scenarios they might not work as expected. Machine learning-based algorithms, such as LLMs are calibrated and heavily rely on predefined training data to solve particular tasks. In order to generalize well, i.e. to work on a variety if not all possible unseen datasets, a large amount of training data is needed. One of the problems that arises with this training procedure is that the datasets are too big to be curated and no one knows the datasets in detail. Furthermore, decisions made by those algorithms are often dicult to trace back so they mainly function as black boxes and it is not trivial to explain their decisions. Biases such as stereotypes about certain demographics that appear in training datasets will then be forwarded to the models and influence their decisions. It is therefore of critical importance to thoroughly understand those models in order to prevent them from harming certain demographics, often those who already suer from implicit biases in society. Language models not only need to be correct, they need to be “right for the right reasons”.
As those models are meant to interact with humans and base their decisions on a human-like reasoning, I argue that in order to understand those models in-depth, it is important to investigate those models further with respect to how well they align with human behaviour regarding dierent demographics. Therefore I want to investigate
a) the reasoning behind a decision, i.e. align attention and gradient-based importance
attributes by models with human fixation patterns with the help of eye-tracking and
b) whether the performance in a task aligns dierently between models and humans for
dierent demographics and languages
c) explain a model’s decisions with respect to gradient-based explainability methods to further open the blackbox and make models more transparent. This way, we can also trace back a model’s decision to the input which explains what part of the data is responsible for a certain outcome. Finally, this research needs
d) to be carried out in a multilingual setting and extended to languages other than English.