We designed a deterministic, two-options reversal learning task, in which head-restrained mice learn a reinforced self-initiated lever-pressing preference, without any external cue indicating the rewarded action. Our task requires mice to actively engage both forepaws in making choices, with each decision involving a specific action — either a press with the right or left forepaw. Unlike recent studies in M2 where animals lick or press a single lever to make a choice, mice in our task could select L1, L2, or both simultaneously (L1&L2). Notably, the frequency of pressing both levers simultaneously increases when the mice explored both levers equally. Yet, a key aspect of our task is that pressing L1&L2 never results in a reward and, according to basic RL principles, should not be reinforced. To account for this, we developed a computational "race" model between actions, where the delay in executing each individual action is linked to its value. Our model captured accurately the animal’s lever-pressing strategy and reward rate, and provided access to lever-pressing values Q according to Rescorla-Wagner-type equations. The prediction of the observed delays without fitting them, as well as the observation that the delay scales inversely with the sum of lever-pressing values, support the robustness of our model, extending its validity beyond a simple regression capability. Using two-photon calcium imaging of M2 neuronal population activity combined with behavioral modeling and optogenetics, we showed that the M2 encodes information about decision-values through persistent population activity, which could be used as a signal to dictate the probability of taking each action. By recording the same neurons throughout the learning process — from naïve to expert stages — we observed that persistent coding evolves gradually from trial to trial, reflecting how the decision-value is updated after each action-outcome pair. This, in turn, determines the rate at which learning occurs and is reversed when the reward contingency changes unexpectedly. These results highlight the use of decision-values by M2 to adapt choice during initial learning without instructive cues.
Whether decision-value is converted into a binary motor command remains an open and critical question. It is possible that this arises through long-range loops between multiple brain regions such as the thalamus, midbrain, cerebellum, and basal ganglia, similar to memory-guided licking tasks in mice. Here we explored whether the the basolateral amygdala (BLA), traditionally recognized for its role in associative fear learning, also contributes to self-initiated, incentive-motivated behaviors. However, the rules of how BLA contributes to learning to initiate initially neutral actions for a positive outcome are unclear. In particular, although the mouse secondary motor cortex (M2), a key region involved in spontaneous action initiation, is a major target of BLA glutamatergic outputs, it is unknown whether and how the BLA-to-M2 communication participates in the self-initiation of incentive-motivated actions. To address these questions, we trained head-fixed mice to press two initially neutral levers to obtain a water reward. Suppression of the BLA-to-M2 synaptic signals by tetanus toxin expression revealed that they are key to rapid behavioral learning. Consistently, we characterized how this synaptic communication scales with learning speed using two-photon microscopy of BLA axonal boutons in M2 during the task. Imaging experiments also revealed functional assemblies of boutons activated at distinct steps of the behavior, suggesting well-separated roles: 1) controlling press initiation, 2) discriminative reporting of lever pressing, and 3) reporting licking. Longitudinal imaging of the same axons revealed that single bouton activity was stable for more than two weeks. Finally, when we devalued the preferred lever, animals learnt to reverse their lever preference and the level of preparatory activity for press scaled with the preference of the chosen lever, which suggests that BLA-to-M2 communication participates in value-based action selection on top of the initial incentive-motivated behavioral learning.