I have performed an eye-tracking experiment where I presented the picture books to individuals, either in a coherent or shuffled order while measuring their eye movements during free viewing. Then I used GPT-2, a large language model to quantify, from separately obtained language narratives, what was semantically salient in images, in order to understand the narratives. In parallel, I used a state-of-the-art model of visual salience (DeepGaze-II) to assess what was salient from a visual standpoint and then assessed how this related to participants gaze behaviour. I found that my new, semantic salience model, could account for differences in eye movement behaviour depending on the condition in which images were presented - when they made up a coherent sequence, participants viewed both relatively more and relatively longer at objects that were semantically salient in images. Visual salience did not capture differences in gaze behaviour between conditions. This shows that the state-of-the-art models of visual salience lack some explanatory power when accounting for gaze behaviour, especially in more naturalistic, meaningful settings. In turn, language models could offer a useful avenue forward to in general capture cognitive influences on visual processing.