Dynamic Network Reconstruction of Human Perceptual and Reward Learning via Multimodal Data Fusion

Periodic Reporting for period 2 - DyNeRfusion (Dynamic Network Reconstruction of Human Perceptual and Reward Learning via Multimodal Data Fusion)

Reporting period: 2022-12-01 to 2024-05-31

The overarching aim of this work is to understand the extent to which perceptual and reward (or value-based) learning and decision-making share common computational and neurobiological principles (and thus whether the human brain operates on domain-general mechanisms to efficiently facilitate adaptive actions). More specifically, we formulated a framework under which the influence of learning on both the perceptual and reward systems could be understood in a common “reward maximization” framework, whereby external rewards (or punishments) and internal self-belief or motivational signals (e.g. metacognitive signals such as confidence) guide future actions and adaptive behavior. We set out to test this framework during two (temporally) different phases of the decision process, (a) the choice phase, where decision alternatives are evaluated and compared to guide action and (b) the outcome phase, where expected outcomes and expected risk signals associated are computed and used to shape future choices. We planned to investigate these two phases across different domains (i.e. perceptual vs value learning) as well as in different contexts (i.e. reward vs punishment based learning).

Another critical component of this ERC work is the fusion of EEG and high resolution fMRI to offer a comprehensive spatiotemporal characterization of the neural system under consideration. Machine learning plays a key role in how EEG and fMRI data are being fused. In recent years, we have been at the forefront of these efforts, and we continue to develop the necessary methodology to fuse EEG and fMRI data during this grant. Specifically, the novelty of our approach relies on extracting reliable endogenous neuronal variability and exploiting trial-to-trial fluctuations in EEG-derived temporal components, to tease apart cascades of relevant neural processes in the fMRI. This new approach allows us to perform dynamic network reconstruction in the human brain, ultimately empowering a level of neuronal understanding that extends beyond what could be inferred with either modality alone. As technology improves and methods for fusion become more sophisticated, the future of EEG/fMRI for non-invasive measurement of brain dynamics looks bright and includes, among other things, meso-scale mapping at ultra-high MR fields, targeted perturbation-based neuroimaging, and the use of more sophisticated methods (e.g. deep learning) to uncover non-linear representations that link the electrophysiological and hemodynamic measurements.

While this grant is focused mainly on basic discovery science, we have already provided evidence of the importance of using bio(neural) markers related to decision making and learning to enable exciting new avenues for diagnostic as well as prognostic stratification of mental health patient groups (e.g. depression, psychosis etc). Our findings offer a paradigm shift in Psychiatry and enable a potentially significant and far-reaching societal impact on the way patients are triaged and treated for mental health problems. Our approach has the potential to predict the efficacy of different treatment protocols prior to the deployment of treatment via the use of predictive imaging biomarkers. Critically, these biomarkers offer additional predictive power, over and above what could be inferred by standard clinical assessment tools alone. As such, our approach can optimise both the outcome and time-to-outcome, while at the same time minimize the use of valuable resources from the national health services. Finally, since our approach allows us to model the mechanisms behind treatment response, it can lead to the identification of novel target sites for future drug development.

In this grant, we have been investigating the role of different brainstem pathways during learning in reward versus punishment contexts. We showed that there are distinct pathways involved in encoding outcome value in each context. Specifically, during reward learning the Substantia Nigra (SN)/Ventral Tegmental Area (VTA) dopaminergic complex and the noradrenergic Locus Coeruleus (LC) are initially activated following negative outcomes, while the VTA subsequently re-engages exhibiting greater responses for positive than negative outcomes, consistent with our previously reported early arousal response and a later value-updating process, respectively. In contrast, during punishment learning, we show that distinct serotonergic Raphe Nuclei (RN) and dopaminergic SN subregions are activated only by negative outcomes with a sustained post-outcome activity across the outcome period, supporting the involvement of these brainstem subregions in avoidance behavior. While we identified these distinct subcortical pathways, we also showed that these pathways converge onto largely overlapping cortical outcome valence signals for both reward and punishment learning. These findings are consistent with context-general higher order cortical representations involved in updating beliefs and guiding future actions, in line with a common-currency account of human decision making and learning. Critically, these insights are enabled by our EEG-informed fMRI fusion approach, which assigned temporal order to the relevant pathways and further revealed latent brainstem activations not observed with stand-alone fMRI.

Inspired by these findings we also investigated how the early motivational salience signals – which emerge in both punishment and reward learning contexts – could differentiate interindividual learning propensities across the two contexts. Specifically, we showed that activity of a post-feedback cortical signature of salience is highly separable across the two contexts independently for positive and negative outcomes and that the degree of separability scaled with interindividual differences in learning accuracy. Moreover, the phasic pupil responses to feedback (proxy for brainstem noradrenergic LC activity) were significantly amplified in the punishing context compared to the rewarding context, the magnitude of which also predicted performance differences, with a significant mediation effect of the downstream cortical signal amplitude on this relationship. These results could point to a general salience mechanism, compatible with the main theories of reward and punishment encoding, that forms a crucial initial stage of reinforcement learning in the brain.

In the work highlighted above we focused primarily on outcome (feedback-locked) neural representations that are involved in reinforcement learning. At the same time, we are also interested in the way decisions are implemented during learning. In other words, we want to characterize the neurocomputational principles driving the decision itself, during the period in which participants make a choice. Perceptual decisions (based on ambiguous sensory information) are typically characterized both computationally and experimentally in terms of an integrative mechanism whereby information supporting different decision alternatives accumulates over time until an internal decision boundary is reached. Here we aimed to test the domain-generality of this accumulation-to-bound mechanism by testing whether it is also at play for value (preference-based) choices that involve integration of both socially relevant evidence as well as non-social (i.e. purely probabilistic) cues.

We offered support that, computationally, this mechanism provides a reliable account of value-based choices and showed that a region of the posterior-medial frontal cortex (pMFC) uniquely explains trial-wise dynamics in the process of evidence accumulation in both social and nonsocial contexts. We further demonstrated a task-dependent coupling between the pMFC and regions of the human valuation system in dorsomedial and ventro-medial prefrontal cortex across both contexts. Finally, we revealed domain-specific representations in regions known to encode the early decision evidence for each context. Critically, these results are suggestive of a domain-general decision-making architecture, whereupon domain-specific information is likely converted into a “common currency” in medial prefrontal cortex and accumulated for the decision in the pMFC.

A separate major objective of this grant is to understand the extent to which perceptual (or sensorimotor) learning could also be realized within a reinforcement learning framework. More specifically, whether the influence of learning on both the perceptual and reward systems could be understood in a common “reward maximization” framework, whereby explicit rewards (or punishments) and internal self-belief or motivational signals (e.g. metacognitive signals such as confidence) guide future actions and adaptive behavior. The influence of explicit feedback on learning has been reliably characterized in a reinforcement learning framework, which traditionally described how one learns to select actions that maximise future external rewards in value-based learning environments.

In this context, reinforcement-like learning can occur based on explicit feedback information without external rewards or purely based on the observer’s internal estimate of confidence as a proxy for feedback. This suggests that the computational description of learning behavior could be generalised to incorporate different feedback signals. However, it is unclear whether the same neural processes used for explicit feedback are flexibly appropriated to implement learning from implicit signals, such as confidence. To date, there has been no direct comparison of the neural mechanisms for learning from confidence to those of learning from explicit feedback within the same experimental task. To address this question, we examined how humans might learn from this implicit feedback in direct comparison with that of explicit feedback, using simultaneous EEG-fMRI.

At the behavioral level we observed no-feedback trials to be modulated by confidence in a similar manner to explicit feedback trials, and a reinforcement learning computational modelling analysis suggested learning integrated confidence even on explicit feedback trials. In the neural data, distinct implicit/explicit sources of value information modulated striatal responses along a dorsal-ventral spatial gradient. We saw evidence that implicit and explicit striatal value signals were integrated in the external globus pallidus (GPe), which was significantly modulated by confidence in the decision-window, and by explicit outcome value in the feedback-window. Stronger connectivity between the external globus pallidus and the thalamus, insular and frontal cortex predicted the interaction between response perseveration and implicit/explicit feedback sign, supporting the role of GPe in modulating learning via information flow in the basal ganglia.

The work highlighted above – especially the role of the basal ganglia and the thalamus in orchestrating external and internal learning signals to shape sequential choice behavior – has critical implications on the way we think about decision making more broadly. More specifically, our findings suggest a more active role of motor processes in decision formation, rather than just a passive involvement in merely externalising the decision by controlling the relevant motor effector structures. Similarly, a recent body of other evidence appears to contradict the strict temporal dichotomy between decisional and motor processes and suggests that the (pre)motor system might play a more direct and causal role in decision-making. These recent findings suggest that neural activity related to motor preparation, starts its build-up before the sensory evidence integration completes, effectively lagging the primary process of evidence accumulation. This would allow the amount of sensory evidence to have a direct impact on motor planning, which is not accounted for in traditional evidence accumulation models that do not explicitly capture the entanglement of sensory evidence accumulation with motor preparation.

Here, we propose a novel alternative computational framework which models this entanglement by introducing a secondary, motor-related, leaky integration process that receives the integrated evidence of the primary decision process as a continuous input, and triggers the actual response when it reaches its own threshold. In other words, the primary evidence accumulator relinquishes control of the eventual choice (and hence the strict requirement of an evidence-independent decision boundary) by passing the integrated evidence along to the motor system. The motor leak adapts the ‘memory’ with which the secondary accumulator re-integrates the primary accumulated sensory evidence, thus adjusting the temporal smoothing in the motor evidence and correspondingly the lag between the primary and motor accumulators. We showed that this alternative theoretical account offers a far superior fit to behavior than conventional single-integration models. We also used electrophysiology to derive neural response profiles predicted that are consistent with the model, thereby offering neurobiological validation to our proposed framework.

Learning to seek rewards and avoid punishments, based on positive and negative choice outcomes, is essential for human survival. Yet, the neural underpinnings of outcome valence in the human brainstem and the extent to which they differ in reward and punishment learning contexts remain largely elusive. In this work, we have employed state-of-the-art multimodal brain imaging to identify outcome (feedback) learning representations in distinct brainstem pathways associated with reward and punishments, that subsequently converge onto common cortical representations to update future beliefs and drive learning.

We also developed a new double-integration framework to study the decision process itself and in doing so we opened up a host of new possibilities for modelling more advanced features of decision behavior, such as adaptation to dynamic changes in the incoming evidence, change-of-mind decisions, post-decision confidence reporting, etc. Within this framework we also offered novel evidence consistent with the ubiquitous use of confidence as implicit teaching signal for driving perceptual learning within a reinforcement-guided framework, similar to the one used for value-based learning. Our analyses offer previously unknown insights in perceptual learning, whereby the value of implicit feedback is encoded in a distinct manner to that of explicit feedback. This highlights that even when external feedback is available, metacognitive estimates of confidence could provide us with additional nuanced information to update our internal processes for improving behaviour.

Moving ahead we plan to tackle an influential theoretical framework for reinforcement-guided learning posits that dopaminergic midbrain neurons form an aggregate representation of spatially and temporally overlapping signals of the physical salience of a reward predicting cue and the actual reward associated with that cue. Due to this spatiotemporal overlap the two individual signals are difficult to decouple, even at the level of single-neuron recordings. In this work, we are using a novel imaging protocol with simultaneous EEG and ultra-high resolution fMRI to offer neurobiological validation for this framework in humans. Moreover, we are investigating how the relevant pathways are affected during rewarding and punishing contexts as we have previously identified behavioral asymmetries in choice accuracy across the two contexts (i.e. accuracy higher for reward than punishment contexts). This asymmetry raises the possibility that there exist interesting interactions across contexts, and we will be aiming to offer a formal account of the neural mechanism(s) driving these interactions.

During reinforcement-guided learning, our expectations are thought to be updated every time we encounter a new experience, based on the relative discrepancy between experienced and expected rewards. This discrepancy is formally referred to as a reward prediction error signal (RePEs), the neural correlates of which have been studied extensively. Most everyday choices, however, are not easy to weigh against previous experiences as they pose varying levels of uncertainty. As such, the choice is not one between known outcomes but rather based on the tolerable degree of variation around an average reward, also known as “risk”. It follows that learning about and updating our representations of the degree of uncertainty associated with rewards can add a behavioral advantage in novel or volatile environments. Just like errors in estimating expected rewards, there can also be errors in the estimated variation around these rewards or risk prediction errors (RiPEs). Unlike RePEs, the neural correlates of RiPEs are less well understood and we will therefore aim to offer a comprehensive neurobiological account of these signals across different domains (social vs non-social decisions).

In terms of decision dynamics, humans and other animals display remarkable efficiency when interacting with the world around them. One example is in sports, where fast dynamic sensory information is translated to precise movements, often within timeframes of a few hundred milliseconds. Making fast decisions to inform behavior comes with a cost to accuracy, known as the speed-accuracy trade-off. While considerable advances have been made in understanding how we balance this trade-off across different decisions, little is known about how we dynamically modulate decision-processes online within individual decisions to flexibly achieve behavioral efficiency. In this work we propose that achieving this behavioral efficiency involves the online coordination of perceptual, cognitive, motor, and critically, metacognitive processes. A large focus of the neuroscience of decision-making has been on the process of accumulating evidence for the decision, while motor processes for externalising decisions, and post-decision appraisal (evaluations of confidence) are often considered tangential to the decision.

Here, we will test a framework in which motor systems take an active role in guiding decisions, while feelings of confidence manifest as a central control mechanism for moderating behavioral efficiency, that is, maximising precision given constraints of time and effort. This work will promote several shifts in current paradigms of decision-making and learning, most notably, the active involvement of both metacognitive and motor systems in these processes. Critically this work also moves beyond the study of individual processes to examine how they operate collectively as a system. Rather than the precision of individual processes, we argue that it is the coordination of the integrative system of perceptual, inference, metacognitive and motor processes that optimises behavior. In this way, our work seeks to bridge several fields in the study of human behavior that have traditionally been studied in isolation.

Dynamic network reconstruction in reward and punishment learning

Periodic Reporting for period 2 - DyNeRfusion (Dynamic Network Reconstruction of Human Perceptual and Reward Learning via Multimodal Data Fusion)

Share this page Share this page on social networks

Download Download the content of the page