Part Of: Neuroeconomics sequence
Content Summary: 8min reading time, 800 words
Reward Prediction Error
An efficient way to learn about the world and its effect on the organism is by utilizing a reward prediction error (RPE) signal, defined as:
The RPE is derived from the Bellman equation, and captures changes in valuation across time. It is thus an error term, a measure of surprise; these are the lifeblood of learning processes.
Phasic dopamine bursts are the vehicle for the RPE signal.
During behavioral conditioning, an animal learns that a behavior is predictive of reward. In such a learning environment, we can see the RPE “travelling forward” in time, until it aligns with cue onset.
Actors and Critics
The RPE signal is used to update the following structures:
- A policy 𝝅 which maps states to actions, S → A.
- A value function V(s) which captures expected future reward, given the current state.
These functions can be computed separately. We call actor the process that updates the policy, and critic the process that updates the value function.
In fact, actors come in different flavors:
- Model-based actors which create models of how the world works (specifically, models of reward function R and transition function T).
- Model-free actors compute policy functions directly, without relying on declarative knowledge.
Model-based approaches to reinforcement learning are outcome-directed, and encode Action-Outcome (AO) Learning. In contrast, model-free approaches correspond to psychological notions of habit, and behaviorist notions of Stimulus-Response (SR) Learning.
If an animal is using an AO Actor, when they see a reward being moved, they immediately update their model and move towards the new location. In contrast, an SR Actor will learn much more slowly, and require several failed attempts at the old solution before updating its reward topography. Animals show evidence for both behaviors.
The above structures are directly implemented in the three loops of the basal ganglia. Specifically, the AO Actor, SR Actor, and Critic are identified as the Associative, Sensorimotor, and Limbic loops, respectively.
We might define habituation as decisions once handled the AO Actor moved to the SR actor. Correspondingly, when brains learn a habit, we see neural activity transition, from the Associative to the Sensorimotor loop.
Wanting and Liking
But there is more to reward than learning. Reward also relates to two other processes: wanting (motivation) vs liking (hedonics).
Wanting can be measured by response rate. Strong evidence identifies response vigor (incentive salience) with tonic dopamine levels within the basal ganglia Limbic Loop (VTA to NAc). High tonic dopamine is associated with subjective feelings of enthusiasm, whereas low levels induce apathy. Pathologically high levels of tonic DA are expressed in schizophrenic delirium, pathologically low levels in Parkinson’s disease (disinterest in movement, thought, etc).
Wanting is the substrate of arousal, or motivation. Its purpose is to controls metabolic expenditure. We can see evidence for this in adjunctive behaviors: a severely hungry rat is highly aroused: if food is out of reach, it will still engage in ritualistic behaviors, such as pacing, gnawing wood, or run excessively. Since they are highly aroused, and consummatory behavior is impossible, this “energy” spills out in unrelated behaviors.
Pleasure and displeasure reactions can be measured by unique facial expressions. Strong evidence identifies liking systems with opioid neurochemistry, as expressed by hot/coldspots in the nucleus accumbens (NAc). This system produces subjective feelings of pleasure and displeasure. Pathologically high levels of opioids (morphine-like substances) results in mania; the converse is comorbid with anhedonia.
We can say that opioids collates information about hunger, thirst, pain, etc into a summary statistic of body state.
Reinforcement learning predicts the existence of three learning structures: an SR Actor which behaves habitually, and AO Actor which behaves in accordance to a model, and a Critic that performs outcome valuation. These three structures are implemented as the three reentrant loops in the basal ganglia.
Besides the directive effects of learning, reward also stimulates wanting (i.e., arousal) and liking (i.e., valence). These functions are implemented as three distinct neurochemical mechanisms.
I highly recommend the following papers, which motivate our discussion of reentrant loops and neurochemistry, respectively.
- Maia (2009). Reinforcement learning, conditioning, and the brain: Successes and challenges.
- Berridge et al (2009). Dissecting components of reward: liking, wanting, and learning.
You might also explore the following, for a contrary opinion:
- Bromberg-Martin et al (2010). Dopamine in motivational control: rewarding, aversive, and alerting