Periodic Reporting for period 3 - M and M (Generalization in Mind and Machine)
Período documentado: 2020-09-01 hasta 2022-02-28
The impressive success of deep networks has led to billions of euros of investments by a wide range of companies attempting to solve important societal tasks, from medical diagnosis, language translation, and self-driving cars. It is also important to better understand how these models work, and whether these models are informing us about how the human brain works. A better understanding of the relationship between these models and humans not only holds the promise of improving these models, but advancing our knowledge in psychology and neuroscience, all important issues for society.
The overall objectives are:
1) Compare the performance of networks and humans in a range of cognitive domains in order to assess whether these models solve problems in a human-like way.
2) Compare the internal representations that networks and humans use to solve various tasks.
3) Carry out behavioural studies that carefully assess human generalization in the domain of vision and language, and compare how well neural networks support human performance
4) When models fail to capture human performance add various cognitive and biological constraints to model to make performance more human-like.
5) By adding cognitive and biological constraints to models we hope to improve the performance of models. This will not only be relevant understanding humans, but also useful for engineers and computer scientists who are only concerned with network performance (regardless of whether the models are similar to humans).
6) Develop new benchmark tests ("mind score") to assess how well models explain the psychology of cognition. These will be made available to other research teams so they can easily assess the psychological plausibility of their models.
a) We have shown that the human visual system supports much greater “on-line” translation invariance than previously claimed. That is, once a human can identify a novel object projected to one retinal location the person can identify this object at a wide range of retinal location. We have also shown that standard deep convolutional networks also fail to support human-like translation invariance unless trained on the relevant transformations. We have shown that DNNs can support translation, scale, rotation in the picture plane, and rotation in the depth plane following appropriate training (Biscione & Bowers, 2021, 2022).
b) We have shown that deep neural network models often recognize objects in ways very differently than humans. In one series of studies we have shown that models are happy to identify objects on the basis of a single diagnostic pixel rather than relying on the shape of an object (Malhotra et al., 2020). In a follow-up study we directly compare the features that humans and DNNs rely on when learning to classify novel objects that can be classified based on shape or a range of non-shape features. Again models do not perform like humans (Malhotra et al., in press).
c) We have shown how deep neural networks of vision do not a encode objects by their parts and the relations between the parts. This again contrasts with humans who explicitly encode object parts and their relations (Malhotra et al., 2022).
d) When we add a biological constraint to convolutional neural networks (adding ‘edge detectors’ much like simple cells in visual cortex) then the behave in a more human-like way, and do not pick up on single-pixels to identify objects. The models appear to rely more on shape given that they are better at identifying line drawings and Silhouettes that only have shape information. They still fail to show a shape bias in many contexts however (Evans et al., 2022; Malhotra et al., 2020).
e) We have shown that the adversarial images that fool networks do no fool humans in a similar way. This challenges the findings from a recent paper published in Nature Communications (Dujmović et al., 2020).
f) We have found that deep networks have super-human capacities to identify unstructured data (e.g. learn to categorize patterns that look like tv-static to humans), a finding that also highlights the disconnect between human and machine vision. We have added some biological constraints to standard networks (e.g. adding noise to the activation of hidden layers, adding resource bottlenecks) to reduce the models capacity to learn unstructured data but still identify structured data such as photographs of images (Tsvetkov et a., 2022).
g) We have shown that that the internal representations in deep networks of object identification are less similar than is commonly claimed based on Representational Similarity Analysis (RSA). Indeed, we show conditions in which networks designed to classify objects in a qualitatively different way than humans nevertheless show high RSAs with brain activations (Dujmović et al., 2020).
h) We have found that DNN do not support a range of Gestalt organizational principles that are central to human perception and object recognition (Biscione & Bowers, 2022).
In the domain of language and reasoning, we have carried out a number of projects, including:
a) We have developed a new empirical method to assess how well models of word naming can generalize to novel words. This method will make it much easier to carry out future studies and evaluate computational models in the future (Gubian et al., in press)
b) We have shown that recurrent networks do not capture some fundamental syntactic generalizations that humans do, including the well-known ‘Principle-C’ constraint (Mitchell & Bowers., 2019), and can learn unstructured languages unlike any human language and that would be difficult if not impossible to learn (Mitchell & Bowers, 2020).
c) We have shown that standard DNNs that have been claimed to support same/different visual reasoning in fact fail when tested appropriately (Puebla & Bowers, 2021), and even models designed to solve relational reasoning fail (Pueble & Bowers, in preparation).
d) We have developed new network architectures that support more widespread generalizations than standard models. This includes solving tasks that Gary Marcus claim require symbols. In these models we used convolutions in a novel way (Evans & Bowers, 2021).
e) We have introduced a new way of representing symbols in conventional network architectures (so-called VARS representations) that allow networks to better support combinatorial generalization in the domain of a simple memory task and in a visual reasoning task (Vankov & Bowers, 2019).
f) We have found that networks that learn disentangled representations continue to fail in combinatorial generalization tasks (Montero et al., 2021, 2022).
g) We have found that models of spoken word identification show some similarities to human speech, but they also differ in fundamental ways suggesting that current models are missing key features of human system (Adolfi et al., 2022).