Codes and Communication

Part Of: Information Theory sequence
Content Summary: 1000 words, 10 min read

History of Communication Systems

Arguably, three pillars of modernity are: industrialization, democratic government, and communication technology. Today, we examine the latter.

Before 1860, long-distance communication required travel. This made communication across large nations quite challenging. Consider, for example, the continental United States. In 1841, it took four months for the news of the death of President Harrison to reach Los Angeles.

The Pony Express (a mail service built on horsepower) improved wait times to ten days. But it was the telegraph that changed the game. The key idea was to send messages on paper, but rather through voltage spikes in electric cables. Electrical pulses travel at near the speed of light.

In 1861, the first transcontinental cable was complete, and instantaneous communication became possible. The Pony Express closed its doors two days later.

It is hard to understate the impact of this technology. These advances greatly promoted information sharing, economic development, and improved governance.

By 1891, thousands of miles of cable had been lain underwater. These pipelines have only become more numerous and powerful over the years. Without them, the Internet would simply be impossible.

communication-undersea-pipelines

Today, we strive to understand the maths of communication. 

Understanding Communication

We start with the basics.

What is communication? The transmission of linguistic information.  

What is language? A shared system of reference communicated through symbols.

References (e.g., words) are functions that maps itself to an aspect of the physical world. References can denote both objects and actions.

Consider the power set of symbols (all possible combinations of letters). Words represent a subset of this object (a family of sets over an alphabet).

Symbol recognition is medium independent. For example, a word can be expressed either through writing (graphemes) or spoken language (phonemes).

communication-language-overview-1

References are the basis of memory. They together build representations of the physical world.

All complex nervous systems construct references. Some animals can communicate (share references). Only humans do so robustly, via syntax.

Semantic interpretations are not restricted to biology. Computers can refer as well. Reference is made possible by symbol grounding.

As the substrate of reference, symbols are the basis of computation. All answerable questions can be solved by a Turing machine.

Semantic aspects of communication are irrelevant to the engineering problem. Coding theory studies symbol sets (alphabets) directly.

Comparing Alphabets

How to compare languages? Let’s find out!

There are 26 symbols in the English alphabet. How many possible three-letter words are there? The answer is 26^3 = 17,576 possible words. More generally:

Possible Messages (M) = Alphabet Size (a) ^ Number of Symbols (X)

M = aX

Log(M) = Loga(X)

Information is the selection of specific words (“red”) from the space of possible words.

We might be tempted to associate information with W. But we desire information to scale linearly with length. Two books should contain twice as much information as one. So we say information is log(M).

I(X, a) = Loga(X)

Alphabet size (logarithmic base) is not very important in this function. Suppose we choose some other base B instead. We can compare alphabets by converting logarithmic base.

Base Conversion: Logb(X) = Loga(X) / Loga(b)

I(X, a) = Loga(X) = Logb(a) * Logb(X)

I(X) = K Logb(X) where K equals Logb(a)

I(X) is known as Shannon information.

We can compare the expressive power of different alphabets. The modern Hawaiian alphabet, for example, has 13 letters. So there are only 13^3 = 2,197 possible three-letter Hawaiian words. The information provided by these respective languages is:

I(Xhawaiian) = Log13(X)

I(Xenglish) = Log13(26) * Log13(X)

I(Xenglish) / I(Xhawaiian) = Log13(26) = 1.270238

We expect English words to be 27% more information than Hawaiian, on average. And indeed, this is precisely what we find:

With 3 English letters: 26^3 = 17,576 possible words
With 3.81 Hawaiian letters: 13^(3*1.270238) = 17,576 possible words

Translating Between Codes

How does one translate between languages? Consider the word “red”. In Hawaiian, this word is “ula’ula”. We might construct the following function:

  • r → ula’
  • e → ul
  • d → a

But this fails to generalize. The Hawaiian word for rice is “laiki”, which does not begin with a ‘u’.

In general, for natural languages any function f: AE → AH is impossible. Why? Because words (references) map to physical reality in arbitrary ways. Two natural languages are too semantically constrained to afford a simple alphabet-based translation.

communication-translation-and-semantic-constraint

Alphabet-based translations are possible, however, if you use a thin language. Thin languages only refer when converted back into its host language. Binary is a classic example of a thin language. It has the smallest possible alphabet (size two).

communication-thin-languages

An encoding is a function of type f: AE → AH. For an example, consider ASCII. This simple encoding is at the root of most modern technologies (including UTF-8, which you are using to view this webpage):

communication-ascii-example

Noise and Discriminability

A communication system has five components: source, transmitter, channel, receiver, and destination.

source and destination typically share a common system of reference. Imagine two people with the same interpretation of the word “red”, or two computers with the same interpretation of the instruction “lb” (load byte).

Transmitter and receiver also tend to play reciprocal roles. Information is exchanged through the channel (e.g., sound waves, cable).

communication-general-communication-architecture

Receivers reconstruct symbols from the physical medium. Noise causes decoding errors.

How can the transmitter protect the message from error? By maximizing the physical differences between symbols. This is the discriminability principle.

Communication- Discriminability (1).png

This principle explains why binary is employed by computers and telecommunications. A smaller alphabet improves symbol discriminability, which combats the effect of noise.

Takeaways

  • Language is a shared system of reference communicated through symbols 
  • References are functions that maps itself to an aspect of the physical world. 
  • Symbol recognition is medium independent
  • Alphabet size determines expressive power (how many messages are possible)
  • An encoding lets you alter (often reduce) language’s alphabet.
  • Such encodings are often desirable because they protect messages from noise.

Entropy as Belief Uncertainty

Part Of: Information Theory sequence
Content Summary: 900 words, 9 min read

Motivations

What do probabilities mean?

A frequentist believes that they represent frequencies. P(snow) =10% means that on 100 days just like this one, 10 of them will have snow.

A Bayesian, on the other hand, views probability as degree of belief. P(snow) = 10% means that you believe there is a 10% chance it will snow today.

This subjective approach views reasoning as probability (degree of belief) spread over possibility. On this view, Bayes Theorem provides a complete theory of inference:

bayes-updating-theory-3

From this equation, we see how information updates our belief probabilities. Bayesian updating describes this transition from prior to posterior, P(H) → P(H|E).

As evidence accumulates, one’s “belief distributions” tend to become sharply peaked. Here, we see degree of belief in a hockey goalie’s skill, as we observe him play. (Image credit Greater Than Plus Minus):

bayesian_updating

What does it mean for a distribution to be uncertain? We would like to say that our certainty grows as the distribution sharpens. Unfortunately, probability theory provides no language to quantify this intuition.

This is where information theory comes to the rescue. In 1948 Claude Shannon discovered a unique, unambiguous way to measure probabilistic uncertainty. 

What is this function? And how did he discover it? Let’s find out.  

Desiderata For An Uncertainty Measure

We desire some quantity H(p) which measures the uncertainty of a distribution.  

To derive H, we must specify its desiderata, or what we want it to do. This task may feel daunting. But in fact, very simple conditions already determine H to within a constant factor. 

We require H to meet the following conditions:

  1. Continuous. H(p) is a continuous function.
  2. Monotonic. H(p) for an equiprobable distribution (that is, A(n) = H(1/n, 1/n, 1/n)) is a monotonic increasing function of n.
  3. Compositionally Invariant. If we reorganize X by bundling individual outcomes into single variables (b: X → W), H is unchanged, H(X) = H(W).

Let’s explore compositional invariance in more detail.

Deriving H

Let us consider some variable X that can assume discrete values (x_1, ..., x_n). Our partial understanding of the processes which determine X are the probabilities (p_1, ..., p_n). We would like to find some H(p_1, ..., p_n), which measures the uncertainty of this distribution.

Suppose X has three possible outcomes. We can derive W by combining events xand x3

entropy-uncertainty-composition-variables

The uncertainty of X must be invariant to such bundling. So we have that:

entropy-uncertainty-composition-example-v1

The right tree has two distributions p(W) and p(X|W). The uncertainty of two distributions is the sum of each individual uncertainty. Thus we add H(⅔, ⅓). But this distribution is reached only ½ of the time, so we multiply by 0.5.

How does composition affect equiprobable distributions A(n)? Consider a new X with 12 possible outcomes, each equally likely to occur. The uncertainty H(X) = A(12), by definition. Suppose we choose to bundle these branches by (3,5,4). Then we have:

entropy-uncertainty-composition-example-v2

But suppose we choose a different bundling function (4,4,4). This simplifies things:

entropy-uncertainty-composition-example-v3-3

For what function of A does A(mn) = A(m) + A(n) hold? There is only one solution, as shown in Shannon’s paper:

A(X) = - Klog(X)

K varies with logarithmic base (bits, trits, nats, etc). With this solution we can derive a general formula for entropy H.

Recall,

X = (x_1, ..., x_n), P(X) = (p_1, ..., p_n)

A(X) = K \log(X) ← Found by uniform bundling (eg., 4,4,4)

A(\sum{n}) = H(X) + \sum\limits_{i} \left( \frac{b_i}{\sum{n}} \right) A(b_i) ← Found by arbitrary bundling (eg., 3,5,4)

Hence,

Klog(\sum{n_i}) = H(X) + K \sum{p_i \log(n_i)}

K \left[ \sum{p_i \log(\sum{n_i})} - \sum{p_i \log(n_i)} \right]

H = -K \sum{p_i \log\left(\frac{n}{\sum{n_i}} \right)}

We have arrived at our definition of uncertainty, the entropy H(X):

H(X) = -K \sum{p_i \log(p_i)}

To illustrate, consider a coin with bias p.  Our uncertainty is maximized for a fair coin, p = 0.5, and smallest at p = 0.0 (certain tails) or 1.0 (certain heads).

entropy-uncertainty-example-graph-of-h-1

Entropy vs Information

What is the relationship between uncertainty and information? To answer this, we must first understand information.

Consider the number of possible sentences in a book. Is this information? Two books contain exponentially more possible sentences than one book.

When we speak of information, we desire it to scale linearly with its length. Two books should contain approximately twice as much information.

If we take the logarithm of the possible messages W, we can preserve this intuition:

I(X) = K \log(W) = K \sum{P(X)}

Recall that,

H(X) = -K \sum{P_i(X) \log P_i(X)}

From here, we can show that entropy is expected information:

H(X) = \sum{P_i(X) \log P_i(X)}

H = E\langle I \rangle

What does this discovery mean, though?

Imagine a device that produces 3 symbols, A, B, or C. As we wait for the next symbol, we are uncertain which symbol comes next. Once a symbol appears our uncertainty decreases, because we have received more information. Information is a decrease in entropy.

If A, B, and C occur at the same frequency, we should not be surprised to see any one letter. But if P(A) approaches 0, then we will be very surprised to see it appear, and the formula says I(X) approaches ∞. For the receiver of a message, information represents surprisal.

On this interpretation, the above formula becomes clear. Uncertainty is anticipated surprise. If our knowledge is incomplete, we expect surprise. But confident knowledge is “surprised by surprise”. 

Conclusions

The great contribution of information theory lies in a measure for probabilistic uncertainty.

We desire this measure to be continuous, monotonic, and compositionally invariant. There is only one such function, the entropy H:

H(X) = -K \sum{p_i \log(p_i)}

This explains why a broad distribution is more uncertain than one that is narrow.

Henceforth, we will view the words “entropy” and “uncertainty” as synonymous.

Related Works

  • Shannon (1948). A Mathematical Theory of Communication
  • Jaynes (1957). Information Theory and Statistical Mechanics
  • Schneider (1995). Information theory primer

An Introduction To Energy

Part Of: Demystifying Physics sequence
Content Summary: 700 words, 7min reading time.

Energy As Universal Currency

Why does burning gasoline allow a car to move? Chemical reactions and kinetic propulsion seem quite distinct.

How does a magnet pull a nail from the ground? What relation exists between magnetism and gravitational pull?

What must occur for a nuclear reactor to illuminate a light bulb? What connection is there between nuclear physics and light waves?

Energy is the hypothesis of a hidden commonality among the above phenomena. There are many forms of energy: kinetic, electric, chemical, gravitational, magnetic, radiant. But these forms are expressions of a single underlying phenomena.

A single object may possess many different energy forms simultaneously:

A block of wood thrown into the air will possess kinetic energy because of its motion, gravitational potential energy because of its height above the ground, chemical energy in the wood (which can be burned), heat energy depending on its temperature, and nuclear energy in its atoms (this form is not readily available from our block of wood, but the other forms may be).

Non-physicists worry that physics involves memorizing a giant catalogue of phenomena, each discovered by some guy who got an equation named after him. Energy is the reason why physics is not “stamp collecting”. It allows us to seamlessly switch between different phenomena.

Change As Energy Transformation

The only thing that is constant is change.
        -Heraclitus

Certain Greek philosophers were obsessed with change. Physics gives us a language that formalizes these intuitions. Consider the following.

  • Energy is the capacity to do work; that is, initiate processes.
  • Processes (i.e., work) involve continuous and controlled actions, or changes.

A confusing diversity of phenomena can produce force (the ability to accelerate things), but they all share the same blood. Change is energy transformation.

We can play “where does the energy come from” game on literally any event in the physical universe:

energy-processes-as-energy-transmutation-1

A Worked Example

Let’s get concrete, and go through a simple illustration of work as energy transformation. 

Our example involves a ball sliding down an incline.

energy-incline-motion-example-problem-3

To get our bearings, let’s calculate the acceleration experienced by the ball

energy-incline-motion-example-acceleration-3

To demonstrate conservation of energy, we first need to solve for the ball’s final velocity (v) at the bottom of the ramp. Recall the lesson of kinematic calculus, that displacement, velocity, and acceleration are intimately related:

x(t) = \int v(t) = \int \int a(t)

a(t) = v'(t) = x''(t)

We can use these formulae to calculate final velocity (for a tutorial see here).

energy-incline-example-kinematic-solution-6

So the ball’s final velocity will be 6.26 meters per second \sqrt{4g} m/s). However, recall the classical definitions of kinetic and potential (gravitational) energy, which are KE = \frac{1}{2}mv^2 and PE = mgh.

If conservation of energy is true, then we should expect the following to hold:

(KE + PE)_{final} - (KE + PE)_{initial} = 0

Is this in fact the case? Yes!

m[(0 + 2g) - (\frac{ \sqrt{(4g)^2} }{2})] + 0 = m[2g - 2g] = 0

Total energy of the ball stays the same across these two points. Conservation of energy can also be demonstrated at any other time in the ball’s journey. In fact, we can show that potential energy is smoothly converted to kinetic energy (image credit Physics Classroom).

conservation_energy

Why is the ball transforming its energy KE \rightarrow PE? Because it is experiencing a force.

Cosmological Interpretation

Just as space and time are expressions of a single spacetime fabric, Einstein also demonstrated mass-energy equivalence. Mass is simply a condensed form of energy, capable of being released in e.g., nuclear explosions. Do not confuse mass and matter. Mass and energy are (interchangeable) properties of matter.

So there are two components of reality: energymass and spacetime. Spacetime bends for energymass. Energy is conserved, but has many faces. The laws of physics (quantum field theory and general relativity) describe the interactions between energymass and spacetime. And since we know that energy is conserved, all there is today was all there was at the beginning of time.

If the amount of energy contained in our universe doesn’t change, what is this quantity? How much energy is there? One strong candidate theory is zero. The flat universe hypothesis rests on the realization that gravitational energy is negative, and claims that it counterbalances all other mediums.

energy-flat-universe-hypothesis

Next time, we will explore the distinction between usable vs inaccessible energy. Until then.

Two Cybernetic Loops

Part Of: Neuroanatomy sequence
Content Summary: 800 words, 8 min read

What Is Perception About?

Consider Aristotle’s five senses: vision, hearing, smell, touch, and taste. We know that senses are windows into physical reality. But what aspects of reality do these represent?

Vision and hearing have a special property: despite receptors being located within the body (proximal), they carry information about phenomena outside of the body (distal). They carry information about the world. In contrast, smell, touch, and taste only represent events close to the body; these encode the interaction between body and world.

This distinction is a neural primitive: the brain encodes World and Interaction in extrapersonal and peripersonal space, respectively.

However, there is a significant lacuna within this binary system: none of these concern the body. Body sensation is a crucial “sixth sense”:

twoloops-body-vs-world

Making Sense of Anatomy

We spend a lot of time discussing the nervous system. But the body houses eight other anatomical systems: reproductive, integumentary (skin), muscular, skeletal, endocrine (hormones), digestive (incl. urinary and excretory subsystems), circulatory (incl. immune and lymphatic subsystems), and respiratory.

To regulate these systems, your brain recruits the following peripheral nervous systems:

  1. Somatic, which contains spinal nerves and cranial nerves
  2. Autonomic, incl. the sympathetic “fight/flight” and parasympathetic “rest/ digest” 
  3. Neuroendocrine, incl. the HPA, HPG, HPT, and Neurohypophyseal axes
  4. Enteric, also called the “second brain”, a large mass of digestion-oriented neurons
  5. Neuroenteric, connects enteric nervous system via microbiome-gut-brain axis
  6. Neuroimmune, recently discovered, primarily mediated by glial cells
  7. Glymphatic, recently discovered, which removes metabolites via CSF during sleep
  8. Neurogaseous, recently discovered, mediated by gasotransmission

The CNS must coordinate all of these to respond to sense data and regulate anatomical systems. A complex undertaking. How might we understand such a process?

With the above trichotomy { world, interaction, body }, anatomical and sensory systems can be organized into meaningful categories:

twoloops-two-kinds-of-neural-phenomena-1

The Interlocking Loop Hypothesis posits the existence of two perception-action loops, inhabiting a gradient of abstraction:

  1. The somatic “cold” loop, world- and interaction-oriented, from exteroception to movement.
  2. The visceral “hot” loop, body-oriented, from interoception to body regulation.

Loops As Organizing Principle

Evidence for the Interlocking Loop Hypothesis comes from two anatomical principles of organisation:

First, the Bell Magendie Law is based on the observation that, in all chordates, sensory information is processed at the back of the brain, and behavioral processes are at the front (“posterior perception, anterior action”):

Cybernetics- Posterior Perception, Anterior Action

Second, the Medial Viscera Principle is the observation that visceral processes tend to reside in the center of the brain (medial regions):

two-loops-medial-viscera-principle-4

Thus we can see our loops clustering at different levels of the abstraction hierarchy.

We can also see our loops’ primary site of convergence:

Anatomically, the two loops converge on the basal ganglia, in which both somatic and visceral processes are blended to yield coherent behavior.

two-loops-intersection-at-basal-ganglia

The above quote & image are from Panksepp (1998), Affective Neuroscience.

The Basis of Motivation

Why should our two loops converge on the basal ganglia? The basal ganglia is the substrate of motivation, or “wanting”. It also participates in reinforcement learning, and its mathematical interpretation as Markov Decision Processes (MDPs).

Historically, the reward function in MDPs has proven difficult to interpret biologically; however, this task becomes straightforward on the Interlocking Loop Hypothesis. Of course the cold loop would tune its behavior to promote the hot loop’s efforts to keep the organism alive.

two-loops-motivation

The Basis of Consciousness

In Can Consciousness Be Explained?, I wrote:

Let me put forward a metaphor. Consciousness feels like the movies. More specifically, it comprises:

  1. The Mental Movie. What is the content of the movie? It includes data captured by your eyes, ears, and other senses.
  2. The Mental Subject. Who watches the movie? Only one person, with your goals and your memories – you!

On this view, to explain consciousness one must explain the origins, mechanics, and output of both Movie and Subject. (Of course, one must be careful that the Subject is not a homunculus, on pain of recursion!)

The Interlocking Loop hypothesis offers an obvious foothold in the science of consciousness:

  • The world-centric cold loop generates the Mental Movie (“a world appears”). 
  • The body-centric hot loop creates the Subject (“narrative center of gravity”)

Thus, we are no longer surprised that opioid anomalies (a visceral loop instrument) are linked to depersonalization disorders; whereas dopamine (the promoter of somatic behavior) is associated with subjective time dilation effects.

Takeaways

First, we introduced the Interlocking Loop Hypothesis:

  • Some perceptions are about the world, others are about the body.
  • The CNS is a visceral body-centric hot loop, and a somatic world-centric cold loop
  • Bell-Magendie Law: perception for both loops is posterior, action is anterior.
  • Medial Viscera Principle: hot loop is located medially, while cold loop is more lateral.

Then, we examined its implications:

  • Motivation, as generated by the basal ganglia, is loop communication software; it allows the hot loop to influence cold loop behavior.
  • Consciousness has two components: the Mental Movie and Mental Subject. These are supported by cold and hot loops, respectively.

Until next time.

Relevant Materials

  • Northoff & Panksepp (2008). The trans-species concept of self and the subcortical–cortical midline system

 

The Function Of The Basal Ganglia

Part Of: Neuroeconomics sequence
Content Summary: 8min reading time, 800 words

Reward Prediction Error

An efficient way to learn about the world and its effect on the organism is by utilizing a reward prediction error (RPE) signal, defined as:

\Delta_t = \left[ r_t(A) + \gamma \sum P(s'|s)V_{t+1}(s') \right] - V_t(s)

The RPE is derived from the Bellman equation, and captures changes in valuation across time. It is thus an error term, a measure of surprise; these are the lifeblood of learning processes.

Phasic dopamine bursts are the vehicle for the RPE signal.

rewardsubstrate-phasic-dopamine-rpe-1

During behavioral conditioning, an animal learns that a behavior is predictive of reward. In such a learning environment, we can see the RPE “travelling forward” in time, until it aligns with cue onset.

Actors and Critics

The RPE signal is used to update the following structures:

  • A policy 𝝅 which maps states to actions, S → A.
  • A value function V(s) which captures expected future reward, given the current state.

These functions can be computed separately. We call actor the process that updates the policy, and critic the process that updates the value function.

In fact, actors come in different flavors:

  • Model-based actors which create models of how the world works (specifically, models of reward function R and transition function T).
  • Model-free actors compute policy functions directly, without relying on declarative knowledge.

Model-based approaches to reinforcement learning are outcome-directed, and encode Action-Outcome (AO) Learning. In contrast, model-free approaches correspond to psychological notions of habit, and behaviorist notions of Stimulus-Response (SR) Learning.

If an animal is using an AO Actor, when they see a reward being moved, they immediately update their model and move towards the new location. In contrast, an SR Actor will learn much more slowly, and require several failed attempts at the old solution before updating its reward topography. Animals show evidence for both behaviors.

The above structures are directly implemented in the three loops of the basal ganglia. Specifically, the AO Actor, SR Actor, and Critic are identified as the Associative, Sensorimotor, and Limbic loops, respectively.

We might define habituation as decisions once handled the AO Actor moved to the SR actor. Correspondingly, when brains learn a habit, we see neural activity transition, from the Associative to the Sensorimotor loop.

Wanting and Liking

But there is more to reward than learning. Reward also relates to two other processes: wanting (motivation) vs liking (hedonics).

Wanting can be measured by response rate. Strong evidence identifies response vigor (incentive salience) with tonic dopamine levels within the basal ganglia Limbic Loop (VTA to NAc). High tonic dopamine is associated with subjective feelings of enthusiasm, whereas low levels induce apathy. Pathologically high levels of tonic DA are expressed in schizophrenic delirium, pathologically low levels in Parkinson’s disease (disinterest in movement, thought, etc).

Wanting is the substrate of arousal, or motivation. Its purpose is to controls metabolic expenditure. We can see evidence for this in adjunctive behaviors: a severely hungry rat is highly aroused: if food is out of reach, it will still engage in ritualistic behaviors, such as pacing, gnawing wood, or run excessively. Since they are highly aroused, and consummatory behavior is impossible, this “energy” spills out in unrelated behaviors.

Pleasure and displeasure reactions can be measured by unique facial expressions. Strong evidence identifies liking systems with opioid neurochemistry, as expressed by hot/coldspots in the nucleus accumbens (NAc). This system produces subjective feelings of pleasure and displeasure. Pathologically high levels of opioids (morphine-like substances) results in mania; the converse is comorbid with anhedonia.

We can say that opioids collates information about hunger, thirst, pain, etc into a summary statistic of body state.

rewardsubstrate-wanting-vs-liking

Takeaways

Reinforcement learning predicts the existence of three learning structures: an SR Actor which behaves habitually, and AO Actor which behaves in accordance to a model, and a Critic that performs outcome valuation. These three structures are implemented as the three reentrant loops in the basal ganglia.

Besides the directive effects of learning, reward also stimulates wanting (i.e., arousal) and liking (i.e., valence). These functions are implemented as three distinct neurochemical mechanisms.

rewardsubstrate-construct-biology-mapping

Related Works

I highly recommend the following papers, which motivate our discussion of reentrant loops and neurochemistry, respectively.

  • Maia (2009). Reinforcement learning, conditioning, and the brain: Successes and challenges.
  • Berridge et al (2009). Dissecting components of reward: liking, wanting, and learning.

You might also explore the following, for a contrary opinion:

  • Bromberg-Martin et al (2010). Dopamine in motivational control: rewarding, aversive, and alerting

Evolution of the Basal Ganglia

Part Of: [Neuroeconomics] sequence
Followup To: [An Introduction to the Basal Ganglia]

Natural History

The Earth accreted from a protoplanetary disc 4.5 billion years ago (Ga). Geologists break up Earth’s history into four eons: the Hadeon, Archaean, Proterozoic, and Phanerozoic eons.

At 3.8 Ga, abiogenesis occurred, and the sea was awash with bacteria. Since then, there have been five major events in the history of life.

  1. At 1.85 Ga, bacterial inbreeding (symbiogenesis) led to the advent of eukaryotes, whose organelles improved cellular flexibility
  2. At 800 Ma, the advent of multicellularity: some eukaryotes discovered ways to act meaningfully in groups.
  3. At 580 Ma, animal-like adaptations, such as motility and ability to consume other living matter (heterotrophy), set off the Cambrian Explosion.
  4. At 380 Ma, some animals developed four limbs (tetrapods) and the ability to become terrestrial animals.
  5. At 320 Ma, some terrestrial animals developed mammary glands, and saw the spark of the mammals.

common-descent-natural-history

We can use the tree of life to better understand these anatomical milestones. Since all life on this planet is related (common descent), we can represent familial relations just as you would on ancestry.com. Key innovations in organism body-plans can be embedded in such graphics, as follows:

common-descent-phylogeny-milestones-2

When confronted with some biological structure, we can employ comparative anatomy to discover its origin. If an adaptation is shared across multiple species, we can infer either homology (the innovation of some common ancestor) or homoplasy (an adaptation appearing independently, a.k.a “convergent evolution”).  

For example, the spine is a homology; whereas homeothermy (warm-bloodedness) and multicellularity are homoplasies. 

Full Circuit in Vertebrates

Last time, we discussed the basal ganglia, a brain structure that is intimately involved in motivation and behavior. Here, we use comparative anatomy to discover the evolutionary origin of the basal ganglia. By dissecting brains from eight representative species, we can infer that the basal ganglia dates back to the origin of vertebrates:

bg-evolution-identifying-homology

Specifically, here are the frontal sections of the eight species. By employing sophisticated histochemistry techniques such as TH-immunostaining, we are able to directly visualize the striatal and pallidal regions of the representative basal ganglia.

bg-evolution-frontal-sections-representative-species-1This investigation was conducted by Anton Reiner in his aptly-titled 2009 paper, You cannot have a vertebrate brain without a basal ganglia. The basal ganglia is not the “reptile brain”, contra the triune brain hypothesis. It is, in fact, much older.

Ancient Subcortical Loops

One of the key structures in the midbrain is the corpora quadrigemina (Latin for “four bodies”). It is composed of bilateral expressions of the superior colliculus (SC), and the inferior colliculus (IC). Anatomically, these structures are four bumps at the posterior of the midbrain; for this reason, the corpora quadrigemina is also called the tectum (Latin for “roof”).

bg-evolution-corpora-quadrigemina

The SC receives inputs from the retina, via input from the LGN nucleus of the thalamus. The IC receives input from the auditory system, and projects to the MGN nucleus of the thalamus. For this reason, it is easy to describe these structures as a vision center, and audio center, respectively.

However, there is more to the story. SC and IC represent space topographically, and densely innervate one another. They seem to participate in coordinate transformations, which integrate multimodal sensory information. The SC and IC are actually composed of distinct anatomical regions, each of which perform specialized tasks. Importantly, the SC Deep Layer functions as a control center: basically, a predecessor of the motor cortex.

bg-evolution-corpora-quadrigemina-tectum-layers

We have seen the basal ganglia processing information from the neocortex. But the neocortex is a mammalian innovation. What did the basal ganglia do before the invention of the neocortex? If you look carefully at the basal ganglia, you can actually see afferents from the GPi / SNr / VP node into the superior colliculus (SC). It turns out that the SC drives its own loop through the basal ganglia:

basal-ganglia-subcortical-loops

The basal ganglia evolved a general-purpose reinforcement learning device, assisting behavioral computations of the superior colliculus. As motor cortex M1 began to complement and compete with the SC for motor control, it was also built on top of basal ganglia loops.

For more details, see McHaffie et al (2005). Subcortical Loops in the Basal Ganglia  

Simplified Circuit in Arthropods

Insects (arthropods) have been around long before vertebrates, evolving around the Cambrian epoch. We saw above that insects (arthropods) have a nerve cord: a predecessor of the spinal cord. Each segment of the body corresponds with a nerve bundle called a ganglia. The head segment of insects, called the cephalon, is particularly important insofar as its associated ganglia (cerebral ganglion) is the direct predecessor of the brain.

Within the cerebral ganglion, we find structures called neuropiles (analogous to modern-day nuclei) which perform specific functions:

bg-evolution-arthropod-nervous-system

One such structure (located in the protocerebrum), is the central complex (in above diagram, called the central body, CB). The central complex contains a fan-shaped body which strikingly resembles the mammalian striatum:

bg-evolution-stiatum-topography

The similarities do not stop there. The basal ganglia and central complex share homologous circuitry, and are even created by the same genetic material. In fact, we can conclude that they are the same structure, with different names. 

bg-evolution-deep-homology

Recall that the basal ganglia contains two pathways: direct and indirect. The central complex does not have an indirect pathway! This suggests that the indirect pathway evolved later, as an elaboration of more primitive motivation circuitry.

For more information, see Strausfeld & Hirth (2013). Deep Homology of Arthropod Central Complex and Vertebrate Basal Ganglia. See this response, however, for a critique.

The Evolution of Dopamine

Dopamine plays a key role in behavioral readiness. The basal ganglia contains ten times more dopamine receptors than any other brain area. When did dopamine evolve? Recall that, as a catecholamine, dopamine (DA) is heavily related to norepinephrine (NE) and epinephrine (EPI):

dopamine-catecholamine-synthesis-3

In order for these neurotransmitters to influence the nervous system, neurons must have receptors responsive to the aforementioned chemicals. Our question becomes, when did these receptors evolve?

By genomic analysis, we can confirm that DA transporters (DAT) came into existence with the invention of bilateral symmetry. This basal bilaterian also contained transporters for serotenin (SERT) and a highly flexible transporter for monoamines (MAT).

In protostomes, the MAT gene was destroyed via mutation, and replaced with the octopamine transporter (OAT). Let me repeat that. Dopamine is not used by insects etc: instead, related chemicals tyramine and octopamine (bolded above) are used in its place. 

History was not much kinder for the deuterostomes, whose dopamine transporter was destroyed. However, this clade duplicated the MAT gene to resurrect dopamine receptivity in subsequence chordates (cDAT).  

bg-evolution-biogenic-amine-evolution-1

The above analysis clearly demonstrates the volatility of natural selection, and how natural selection uses the resources at its disposal to construct neurotransmitter systems like dopamine. For more information, see Caveney et al (2006). Ancestry of neuronal monoamine transporters in the Metazoa.

Summary

  • Comparative anatomy dates the emergence of the basal ganglia to at least as early as the vertebrate clade.
  • The basal ganglia also supports the “control center” of the Deep Layer of the SC, which predates its support of neocortex.
  • Incredibly, the basal ganglia predate the brain, originated prior to arthropods (insects)! The central complex is the vertebrate basal ganglia.
  • The arthropod version of the basal ganglia does not include an indirect pathway. This innovation happened later.
  • Prior to the creation of the basal ganglia, dopamine assumed its role in promoting behavior near the invention of the core animal body-plan.

We will close by condensing these discoveries into a single graphic:

brain-evolution-milestones-4

Until next time.

An Introduction To Behaviorism

Part Of: Neuroeconomics sequence
Content Summary: 1100 words, 11 min read

Historical Context

William Jaynes (1842-1910), the “father of American psychology”, was also a world-renowned philosopher, who together with CS Peirce and John Dewey, founded the philosophical school of pragmatism. This illustrates that, in the early days psychology (along with many other sciences) was more closely interwoven with philosophy.

This connection can be seen into the 1930s, when two related movements conquered the intellectual zeitgeist. In analytic philosophy, logical positivism from the Vienna Circle quickly gained traction. Logical positivism partitioned language into two components: synthetic (“this leaf is green”) and analytic statements (“all men are bachelors”). Further, it relied on the verification principle, that concepts are only meaningful by virtue of the operations used to measure them.

Concurrently, BF Skinner inaugurated the research tradition of behaviorism, which focused on relationships between Stimulus and Response (SR associations). Influenced by positivism, Skinner also promoted radical behaviorism, which insisted that talk about subjective experience, and all mental phenomena, had no meaning.

In 1951, Quine penned his Two Dogmas of Empiricism, which marked the death knell of logical positivism. And in 1967, Noam Chomsky wrote his Review Of Skinner’s Verbal Behavior, a devastating takedown of radical behaviorism. Chomsky helped usher in the cognitive revolution, with its key metaphor of brain as computer. Importantly, just as we can inspect the inner workings of a computer, we can also hope to learn about events that transpire between our ears.

I disagree with the philosophy of radical behaviorism. But the empirical results of behaviorism persist, and demand explanation. So today, let’s explore what this research programme learned some sixty years ago.

Classical Conditioning

The most important result from behaviorist experiments is conditioning, a robust form of associative learning. It comes in two flavors:

  1. Classical conditioning is about learning in behavior-irrelevant situations (where e.g., a rat’s behavior doesn’t affect foot shock).
  2. Instrumental conditioning is about learning in behavior-relevant situations (where e.g., a rat’s behavior affects foot shock).

To illustrate the former, let’s turn to a famous experiment by Pavlov. Some behaviors seem innate: you don’t need to teach a rat to dislike pain, and you don’t need to teach a dog to salivate when presented with food. Call these innate associations unconditioned stimulus US and unconditioned response UR. In contrast, neutral stimuli fail to elicit meaningful behavior.

behaviorism-innate-stimuli-1

Pavlov’s insight: if you consistently ring a bell before providing food, the food-salivation association will change! The salivation reflex will travel forward in time, towards the conditioning stimulus CS (in this case, the bell).

behaviorism-classical-conditioning-2

Note the CS bell serves as a predictor of reward. Since salivation readies the mouth for digestion, it makes sense for the brain to initiate this preparation as soon as it learns of an imminent meal.

On Carrots And Sticks

Have you ever heard the expression “should I use a carrot or stick”? This idiom derives from a cart driver dangling a carrot in front of a mule and holding a stick behind it. The claim is that there are basically two routes to alter behavior: positive and negative feedback.

But this idiom is incomplete. We must also consider the effect of removing carrots & sticks. So there are four ways to alter behavior:

behaviorism-carrot-vs-stick

Let us reorganize this taxonomy, by grouping actions that increase or decrease behavior, respectively:

behaviorism-reinforcement-vs-punishment

Instrumental Conditioning

Of course, proverbs can be wrong! For example, the “sugar high” myth remains alive and well, at least in my social circles. Do “carrots and sticks” really alter behavior?

Back in 1898, Thorndike conducted his Puzzle Box experiment, which trapped a cat in a small space, with a hidden lever that facilitated escape. On the first trial, the cat relies heavily on its innate escape behaviors: scratching at bars, pushing at ceiling, etc. However, the only behavior that facilitated escape was pressing a lever. After repeating this experiment several times, the cat has learned to not waste time scratching etc: it presses the lever immediately.

behaviorism-law-of-effect-3

This is proof of negative reinforcement: after pressing a lever, an unpleasant state (confinement) was removed, and frequency of lever-pressing subsequently increased.

By now, strong evidence supports instrumental learning from all four kinds of reinforcements and punishments. The Law of Effect expresses this succinctly:

Responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation, and responses that produce a discomforting effect become less likely to occur again in that situation.

This effect occurs in nearly all biological organisms! This suggests that the brain mechanisms underlying this ability are highly conserved across species. But I will leave my remarks on the biological substrate of conditioning for another post. 🙂

Shaping

Reinforcement and punishment are powerful learning tools. One successful research programme of behaviorism is behavioral control: engineering the right sequence of positive & negative outcomes to dictate an organism’s behavior.

Shaping is an important vehicle for behavioral control. If you desire an animal to exhibit some behavior (even one that would never occur naturally), you simply apply differential reinforcement of successive approximations. Take, for example, rat basketball:

Animal training relies heavily on shaping techniques. In the words of BF Skinner,

By reinforcing a series of successive approximations, we bring a rare response to a very high probability in a short time. … The total act of turning toward the spot from any point in the box, walking toward it, raising the head, and striking the spot may seem to be a functionally coherent unit of behavior; but it is constructed by a continual process of differential reinforcement from undifferentiated behavior, just as the sculptor shapes his figure from a lump of clay.

The topic of behavioral control is often met with discomfort: don’t such findings empower manipulative people? I want to add two comments here:

  • You don’t see primates shaping each other’s behaviors, despite its obvious adaptive values. Why? I suspect primates like ourselves possess emotional software that detects and punishes social manipulation. Specifically, I suspect our intuitions about personal autonomy and moral inflexibility evolved for precisely this purpose.
  • Shaping, and associative learning, are not unlimited in scope. You can shape a rat to play basketball, but shaping will completely fail to produce e.g., self-starvation. The brain is not a tabula rasa, and it cannot be stretched beyond its biological constraints.

Takeaways

  • Philosophically, radical behaviorism collapsed in the 1970s. But it left behind important empirical results.
  • A wide swathe of animals exhibit classical conditioning, which is learning to associate innate responses with (previously meaningless) predictors.
  • Extensive evidence for also suggests the brain can perform instrumental conditioning, a more behaviorally-relevant form of learning.
  • Specifically, the Law of Effect states that satisfying outcomes increase the preceding behavior, and vice versa.
  • The instrumental conditioning technique of shaping is still used today by animal trainers to install utterly novel behaviors in animals, such as rats playing basketball. 🙂

For another look at conditioning, I recommend this video.

Until next time.

An Introduction to Prospect Theory

Part Of: [Neuroeconomics] sequence
Content Summary: 1500 words, 15 min reading time

Preliminaries

Decisions are bridges between perception and action. Not all decisions are cognitive. Instead, they occur at all levels of the abstraction hierarchy, and include things like reflexes. 

Theories of decision tend to constrain themselves to cognitive phenomena. They come in two flavors: descriptive (“how does it happen”) and normative (“how should it happen”).

Decision making often occurs in the context of imperfect knowledge. We may use probability theory as a language to reason about uncertainty. 

Let risk denote variance in the probability distribution of possible outcomes. Risk can exist regardless of whether a potential loss is involved. For example, a prospect that offers a 50-50 chance of paying $100 or nothing is more risky than a prospect that offers $50 for sure – even though the risky prospect entails no possibility of losing money.

Today, we will explore the history of decision theory, and the emergence of prospect theory. As the cornerstone of behavioral economics, prospect theory provides an important theoretical surface to the emerging discipline of neuroeconomics.

Maximizing Profit with Expected Value

Decision theories date back to the 17th century, and a correspondence between Pascal and Fermat. There, consumers were expected to maximize expected value (EV), which is defined as probability p multiplied by outcome value x.

EV = px

To illustrate, consider the following lottery tickets:

prospect-theory-interchangeable-expected-value-options-2

Suppose each ticket costs 50 cents, and you have one million dollars to spend. Crucially, it doesn’t matter which ticket you buy! Each of these tickets has the same expected value: $1. Thus, it doesn’t matter if you spend the million dollars on A, B, or C – each leads to the same amount of profit.

The above tickets have equal expected value, but they do not have equal risk. We call people who prefer choice A risk averse; whereas someone who prefers C is risk seeking.

Introducing Expected Utility

Economic transactions can be difficult to evaluate. When trading an apple for an orange, which is more valuable? That depends on a person’s unique tastes. In other words, value is subjective.

Let utility represent subjective value. We can treat utility as a function u() that operates on objective outcome x. Expected utility, then, is highly analogous to expected value:

EU = pu(x)

Most economists treat utility functions as abstractions: people act as if motivated by a utility function. Neuroeconomic research, however, suggests that utility functions are physically constructed by the brain.

Every person’s utility function may be different. If a person’s utility curve is linear, then expected utility converges onto expected value:

EU \rightarrow EV \mid u(x) = x

Recall in the above lottery, the behavioral distinction between risk-seeking (preferring ticket A) and risk-averse (preferring C). Well, in practice most people prefer A. Why?

We can explain this behavior by appealing to the shape of the utility curve! Utility concavity produces risk aversion:

 In the above, we see the first $50 (first vertical line) produces more utility (first horizontal line) than the second $50.

Intuitively, the first $50 is needed more than the second $50. The larger your wealth, the less your need. This phenomenon is known as diminishing marginal returns.

Neoclassical Economics

In 1944, von Neumann and Morgenstern formulated a set of axioms that are both necessary and sufficient for representing a decision-maker’s choices by the maximization of expected utility.

Specifically, if you assume an agent’s preference set accommodates these axioms…

1. Completeness. People have preferences over all lotteries.

\forall L_1, L_2 \in L either L_1 \leq L_2 or L_1 \geq L_1 or L_1 = L_2

2. Transitivity. Preferences are expressed consistently.

\forall L_1, L_2, L_3 \in L if L_1 \leq L_2 and L_2 \leq L_3 then L_1 \leq L_3

3. Continuity. Preferences are expressed as probabilities.

L_1, L_2, L_3 \in L then \exists \alpha, B s.t. L_1 \geq L_2 \geq L_3 iff \alpha L_1 + (1-\alpha)L_3 \geq L_2 \geq BL_1 + (1 - B)L_3

4. Substitution. If you prefer (or are indifferent to) lottery L_1​ over L_2, mixing both with the same third lottery L_3​ in the same proportion α must not reverse that preference—adding identical “padding” is irrelevant to the choice.

\forall\,L_1,L_2,L_3\in\mathcal{L},\;\forall\,\alpha\in(0,1):\; L_1\succeq L_2 \;\Rightarrow\; \alpha L_1 + (1-\alpha)L_3 \succeq \alpha L_2 + (1-\alpha)L_3

The above axioms constitute expected utility theory, and form the cornerstone for neoclassical economics.  Expected utility theory bills itself as both a normative and descriptive theory: that we understand human decision making, and have a language to explain why it is correct.

Challenges To Substitution Axiom

In the 1970s, expected utility theory came under heavy fire for failing to predict human behavior. The emerging school of behavioral economics gathered empirical evidence that von Neumann-Morgenstern axioms were routinely violated in practice, especially the substitution axiom.

For example, the Allais paradox asks our preferences for the following choices:

prospect-theory-allais-paradox-4

Most people prefer A (“certain win”) and D (“bigger number”). But these preferences are inconsistent, because C = 0.01A and D = 0.01B. The substitution axiom instead predicts that A ≽ B if and only if C ≽ D.

The Decoy effect contradicts the Independence of Irrelevant Alternatives (IIA). I find it to be best illustrated with popcorn:

decoy2

Towards a Value Function

Concurrently to these criticisms of the substitution axiom, the heuristics and biases literature (led by Kahneman and Tversky) began to discover new behaviors that demanded explanation:

  • Risk Aversion. In most decisions, people tend to prefer smaller variance in outcomes.
  • Everyone prefers gains over losses, of course. Loss Aversion reflects that losses are felt more intensely than gains of equal magnitude.
  • The Endowment Effect. Things you own are intrinsically valued more highly. Framing decisions as gains or as losses affects choice behavior.
Prospect Theory- Behavioral Effects Economic Biases (1)

Each of these behavioral findings violate the substitution axiom, and cumulatively demanded a new theory. And in 1979, Kahneman and Tversky put forward prospect theory to explain all of the above effects.

Their biggest innovation was to rethink the utility function. Do you recall how neoclassical economics appealed to u(x) concavity to explain risk aversion? Prospect theory takes this approach yet further, and seeks to explain all of the above behaviors using a more complex shape of the utility function. 

Let value function \textbf{v(x)} represent our updated notion of utility.  We can define expected prospect \textbf{EP} of a function as probability multiplied by the value function

EP = pv(x)

Terminology aside, each theory only differs in the shape of its outcome function.

Prospect Theory- Evolution of Utility Function (3)

Let us now look closer at the the shape of v(x):

Prospect Theory- Value Function.png

This shape allows us to explain the above behaviors:

The endowment effect captures the fact that we value things we own more highly. The reference point in v(x), where x = 0, captures the status quo. Thus, the reference point allows us to differentiate gains and losses, thereby producing the endowment effect.

Loss aversion captures the fact that losses are felt more strongly than gains.  The magnitude of v(x) is larger in the losses dimension. This asymmetry explains loss aversion.

We have already explained risk aversion by concavity of the utility function u(x). v(x) retains concavity for material gains. Thus, we have retained our ability to explain risk aversion in situations of possible gains. For losses, v(x) convexity predicts risk seeking.

Towards a Weight Function

Another behavioral discovery, however, immediately put prospect theory in doubt:

  • The Fourfold Pattern. For situations that involve very high or very low probabilities, participants often switch their approaches to risk.

To be specific, here are the four situations and their resultant behaviors:

  1. Fear of Disappointment. With a 95% chance to win $100, most people are risk averse.
  2. Hope To Avoid Loss. With a 95% chance to lose $100, most people are risk seeking.
  3. Hope Of Large Gain. With a 5% chance to win $100, most people are risk seeking.
  4. Fear of Large Loss. With a 5% chance to lose $100, most people are risk averse.

Crucially, v(x) fails to predict this behavior. As we saw in the previous section, it predicts risk aversion for gains, and risk seeking for losses:

Prospect Theory- Fourfold Pattern Actual vs Expected (2)

Failed predictions are not a death knell to a theory. Under certain conditions, they can inspire a theory to become stronger!

Prospect theory was improved by incorporating a probability-weighting function.

EP = pv(x) \rightarrow EP = w(p)v(x)

Where w(p) has the following shape:

Prospect Theory- Weight Function (1)

These are in fact two probability-weighting functions (Hertwig & Erev 2009)

  1. Explicit weights represent probabilities learned through language; e.g., when reading the sentence “there is a 5% chance of reward”.
  2. Implicit weights represent probabilities learned through experience, e.g., when the last 5 out of 100 trials yielded a reward.

This change adds some mathematical muscle to the ancient proverb:

Humans don’t handle extreme probabilities well.

And indeed, the explicit probability-weighting function successfully recovers the fourfold pattern:

fourfold_pattern

Takeaways

Today we have reviewed theories of expected value, expected utility (neoclassical economics), and prospect theory. Each theory corresponds to a particular set of conceptual commitments, as well a particular formula:

EV = px

EU = pu(x)

EP = w(p)v(x)

However, we can unify these into a single value formula V:

V = w(p)v(x)

In this light, EV and EU have the same structure as prospect theory. Prospect theory distinguishes itself by using empirically motivated shapes:

Prospect Theory- Evolution of Both Functions

With these tools, prospect theory successfully recovers a wide swathe of economic behaviors.

prospect-theory-behavioral-explananda-2

Until next time.

Works Cited

  • Hertwig & Erev (2009). The description–experience gap in risky choice

Markov Decision Processes

Part Of: Reinforcement Learning sequence
Followup To: An Introduction To Markov Chains
Content Summary: 900 words, 9 min read

Motivations

Today, we turn our gaze to Markov Decision Processes (MDPs), a decision-making environment which supports our propensity to learn from good and bad outcomes. We represent outcome desirability with a single number, R. This value is used to refine action selection: given a particular situation, what action will maximize expected reward?

In biology, we can describe the primary work performed by an organism is to maintain homeostasis: maintaining metabolic energy reserves, body temperature, etc in a widely varying world. 

Cybernetics provide a clear way of conceptualizing biological reward. In Neuroendocrine Integration, we discussed how brains must respond both to internal and external changes. This dichotomy expresses itself as two perception-action loops: a visceral body-oriented loop, and a cognitive world-centered one.

Rewards are computed by the visceral loop. To a first approximation, reward encode progress towards homeostasis. Food is perceived as more rewarding when the body is hungry, this is known as alliesthesia. Reward information is delivered to the cognitive loop, which helps refine its decision making.

Reinforcement Learning- Reward As Visceral Efferent

Extending Markov Chains

Recall that a Markov Chain contains a set of states S, and a transition model P. A Markov Decision Process (MDP) extends this device, by adding three new elements.

Specifically, an MDP is a 5-tuple (S, P, A, R, ɣ):

  • A set of states s ∈ S
  • A transition model Pa(s’ | s).
  • A set of actions a ∈ A
  • A reward function R(s, s’)
  • A discount factor ɣ

To illustrate, consider GridWorld. In this example, every location in this two-dimensional grid is a state, for example (1,0). State (3,0) is a desirable location: R(s(3,0)) = +1.0, but state (3,1) is undesirable, R(s(3,1)) = -1.0. All other states are neutral.

Gridworld supports four actions, or movements: up, down, left, and right.  However, locomotion is imperfect: if Up is selected, the agent will only move up with 80% probability: 20% of the time it will go left or right instead. Finally, attempting to move into a forbidden square will simply return the agent to its original location (“hitting the wall”).

Reinforcement Learning- Example MDP Gridworld

The core problem of MDPs is to find a policy (π), a function that specifies the agent’s response to all possible states. In general, policies should strive to maximize reward, e.g., something like this:

Reinforcement Learning- Example MDP Policy

Why is the policy at (2,2) Left instead of Up? Because (2,1) is dangerous: despite selecting Up, there is a 10% chance that the agent will accidentally move Right, and be punished.

Let’s now consider an environment with only three states A, B, and C.  First, notice how different policies change the resultant Markov Chain:

reinforcement-learning-policy-vs-markov-chain-1

This observation is important. Policy determines the transition model.

Towards Policy Valuation V(s)

An agent seeks to maximize reward. But what does that mean, exactly?

Imagine an agent selects 𝝅1. Given the resultant Markov Chain, we already know how to use matrix multiplication to predict future locations St. The predicted reward Pt is simply the dot product of expected location and the reward function. 

P_t = S_t \cdot R

markov-chains-linear-algebra-expected-value

We might be tempted to define the value function V(S) as the sum of all predicted future rewards:

V_O(S) = P_0 + P_1 + P_2 + P_3 + \dots = \sum{P_k}

However, this approach is flawed.  Animals value temporal proximity: all else equal, we prefer to obtain rewards quickly. This is temporal discounting: as rewards are further removed from the present, their value is discounted. 

In reinforcement learning, we implement temporal discounting with the gamma parameter: rewards that are k timesteps away are multiplied by the exponential discount factor \gamma^k. The value function becomes:

V_O(S) = P_0 + \gamma P_1 + \gamma^2 P_2 + \gamma^3 P_3 + \dots = \sum{\gamma^k P_k}

Without temporal discounting, V(s) can approach infinity. But exponential discounting ensures V(s) equals a finite valueFinite valuations promote easier computation and comparison of state evaluations. For more on temporal discounting, and an alternative to the RL approach, see An Introduction to Hyperbolic Discounting.

Intertemporal Consistency

In our example, at time zero our agent starts in state A. We have already used linear algebra to compute our Pk predictions. To calculate value, we simply compute $latex \sum{\gamma^k P_k}$

V_0(A) = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3

Agents compute V(s) at every time step. At t=1, two valuations are relevant:

V_1(A) = 0 + 0 + 0.64 \gamma^2 + \dots

V_1(B) = 0 + 0.8 \gamma + 0.96 \gamma^2 + \dots

mdp-value-function-timeslice

What is the relationship between the value functions at t=0 and t=1? To answer this, we need to multiply each term by \gamma P(X|A), where X is the state being considered at the next time step.

W_1(A) \triangleq \gamma 0.2 V_1(A)

W_1(A) = 0 + 0 + (0.2)(0.64)\gamma^3 + \dots

Similarly,

W_1(B) \triangleq \gamma P(B|A)V_1(B) = \gamma 0.8 V_1(B)

W_1(B) 0 + (0.8)(0.8) \gamma^2 + (0.8)(0.96) \gamma^3 + \dots

Critically, consider the sum X = r_0(s) + W_1(A) + W_1(B):

X = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3 + \dots

MDP- Intertemporal Consistency

Does X_0 look familiar? That’s because it equals V_0(A)! In this way, we have a way of equating a valuation at t=0 and t=1. This property is known as intertemporal consistency.

Bellman Equation

We have seen that V_0(A) = X_0. Let’s flesh out this equation, and generalize to time t.

V_t(s) = r_t(A) + \gamma \sum{P(s'|s)V_{t+1}(s')}

This is the Bellman Equation, and it is a central fixture in control systems. At its heart, we define value in terms of both immediate reward and future predicted value. We thereby break up a complex problem into small subproblems, a key optimization technique that can be approached with dynamic programming.

Next time, we will explore how reinforcement learning uses the Bellman Equation to learn strategies with which to engage its environment (the optimal policy 𝝅). See you then!