An Introduction to Language Models

Part Of: Language sequence
Content Summary: 1500 words, 15 min read

Why Language Models?

In the English language, ‘e’ appears more frequently than ‘z’. Similarly,  “the” occurs more frequently than “octopus”. By examining large volumes of text, we can learn the probability distributions of characters and words.

Language Models_ Letter and Word Frequency

Roughly speaking, statistical structure is distance from maximal entropy. The fact that the above distributions are non-uniform means that English is internally recoverable: if noise corrupts part of a message, the surrounding can be used to recover the original signal. Statistical structure is also used to reverse engineer secret codes such as the Roman cipher.

We can illustrate the predictability of English by generating text based on the above probability distributions. As you factor in more of the surrounding context, the utterances begin to sound less alien, and more like natural language.

Language Model_ Structure of English

A language model exploits the statistical structure of a language to express the following:

  • Assign a probability to a sentence P(w_1, w_2, w_3, \ldots w_N)
  • Assign probability of an upcoming word P(w_4 \mid w_1, w_2, w_3)

Language models are particularly useful in language perception, because they can help interpret ambiguous utterances. Three such applications might be,

  • Machine Translation: P(\text{high winds tonight}) > P(\text{large winds tonight})
  • Spelling correction: P(\text{fifteen minutes from}) > P(\text{fifteen minuets from})
  • Speech Recognition: P(\text{I saw a van}) > P(\text{eyes awe of an})

Language models can also aid in language production. One example of this is autocomplete-based typing assistants, commonly displayed within text messaging applications. 

Towards N-Grams

A sentence is a sequence of words \textbf{w} = (w_1, w_2, \ldots, w_3). To model the joint probability over this sequence, we use the chain rule:

p(\text{this is the house})

= p(\text{this})p(\text{is}\mid\text{this})p(\text{the}\mid\text{this is})p(\text{house}\mid\text{this is the})

As the number of words grows, the size of our conditional probability tables (CPTs) quickly becomes intractable. What is to be done? Well, recall the Markov assumption we introduced in Markov chains.


The Markov assumption constrains the size of our CPTs. However, sometimes we want to condition on more (or less!) than just one previous word. Let v denote how many variables we admit in our context. A variable order Markov model (VOM) allows v elements in its context: p(s_{t+1} | s_{t-v}, \ldots, s_{t}). Then the size of our CPT is n=v+1, because we must take our original variable into account. Thus an N-gram is defined as a v-order Markov model. By far, the most common choices are trigrams, bigrams, and unigrams:

Language Models_ Ngram comparison (1)

We have already discussed Markov Decision Processes, used in reinforcement learning applications.  We haven’t yet discussed MRFs and HMMs. VOMs represent a fourth extension: the formalization of N-grams. Hopefully you are starting to appreciate the  richness of this “formalism family”. 🙂

Language Model_ Markov Formalisms (1)

Estimation and Generation

How can we estimate these probabilities? By counting!


Let’s consider a simple bigram language model. Imagine training on this corpus:

This is the cheese.

That lay in the house that Alice built.

Suppose our trained LM encounters the new sentence “this is the house”. It estimates its probability as:

p(\text{this is the house})

= p(\text{this})p(\text{is} \mid \text{this})p(\text{the} \mid \text{is})p(\text{house} \mid \text{the}) 

= \dfrac{1}{12} * 1 * 1 * \dfrac{1}{2} = \dfrac{1}{24}

How many problems do you see with this model? Let me discuss two.

First, we have estimated that p(\text{this}) = \dfrac{1}{24}. And it is true that “this” occurs only once in our toy corpus above. But out of two sentences, “this” leads half of them. We can express this fact by adding a special START token into our vocabulary.

Second, recall what happens when language models generate speech. Once they begin a sentence, they are unable to end it! Adding a new END token will allow our model the terminate a sentence, and begin a new one.

With these new tokens in hand, we update our products as follows:

Language Models_ Sentence Estimation (1)

A couple other “bug fixes” I’ll mention in passing:

  • Out-of-vocabulary words are given zero probability. It helps to add an unknown  (UNK) pseudoword and assign it some probability mass.
  • LMs prefer very short sentences (sequential multiplication is monotonic decreasing). We can address this e.g., normalizing by sentence length.


In the last sentence in the image above, we estimate p(END|house) = 0, because we have no instances of this two-word sequence in our toy corpus. But this causes our language model to fail catastrophically: the sentence is deemed impossible (0% probability).

This problem of zero probability increases as we increase the complexity of our N-grams. Trigram models are more accurate than bigrams, but produce more p=0 events. You’ll notice echoes of the bias-variance (accuracy-generalization) tradeoff.

How can we remove zero counts? Why not add one to every word? Of course, we’d then need to increase the size of our denominator, to ensure the probabilities still sum to one. This is Laplace smoothing

Language Model_ Laplace Smoothing

In a later post, we will explore how (in a Bayesian framework) such smoothing algorithms can be interpreted as a form of regularization (MAP vs MLE).

Due to its simplicity, Laplace smoothing is well-known  But several algorithms achieve better performance.  How do they approach smoothing?

Recall that a zero count event in an N-gram is not likely to occur in (N-1)-gram model. For example, it is very possible that the phrase “dancing were thought” hasn’t been seen before. 

Language Model_ Backoff Smoothing

While a trigram model may balk at the above sentence, we can fall back on the bigram and/or unigram models. This technique underlies the Stupid Backoff algorithm.

As another variant on this theme, some smoothing algorithms train multiple N-grams, and essentially use interpolation as an ensembling method. Such models include Good-Turing and Kneser-Ney algorithms.

Beam Search

We have so far seen examples of language perception, which assigns probabilities to text. Let us consider language perception, which generates text from the probabilistic model. Consider machine translation. For a French sentence \textbf{x}, we want to produce the English sentence \textbf{y} such that y^* = \text{argmax } p(y\mid x).  

This seemingly innocent expression conceals a truly monstrous search space. Deterministic search has us examine every possible English sentence. For a vocabulary size V, there are V^2 possible two-word sentences. For sentences of length n, our time complexity of our brute force algorithm is O(V^n).

Since deterministic search is so costly, we might consider greedy search instead. Consider an example French sentence \textbf{x} “Jane visite l’Afrique en Septembre”. Three candidate translations might be,

  • y^A: Jane is visiting Africa in September
  • y^B: Jane is going to Africa in September
  • y^C: In September, Jane went to Africa

Of these, p(y^A|x) is the best (most probable) translation. We would like greedy search to recover it.

Greedy search generates the English translation, one word at a time. If “Jane” is the most probable first word \text{argmax } p(w_1 \mid x), then the next word generated is \text{argmax } p(w_2 \mid \text{Jane}, x). However, it is not difficult to contemplate p(\text{going}\mid\text{Jane is}) > p(\text{visiting}\mid\text{Jane is}), since the word “going” is used so much more frequently in everyday conversation. These problems of local optima happen surprisingly often.

The deterministic search space is too large, and greedy search is too confining. Let’s look for a common ground.

Beam search resembles greedy search in that it generates words sequentially. Whereas greedy search only drills one such path in the search tree, beam search drills a finite number of paths. Consider the following example with beamwidth b=3


As you can see, beam search elects to explore y^A as a “second rate” translation candidate despite y^B initially receiving the most probability mass. Only later in the sentence does the language model discover the virtues of the y^A translation. 🙂

Strengths and Weaknesses

Language models have three very significant weaknesses.

First, language models are blind to syntax. They don’t even have a concept of nouns vs. verbs!  You have to look elsewhere to find representations of pretty much any latent structure discovered by linguistic and psycholinguistic research.

Second, language models are blind to semantics and pragmatics. This is particularly evident in the case of language production: try having your SMS autocomplete write out an entire sentence for you. In the real world, communication is more constrained: we choose the most likely word given the semantic content we wish to express right now.

Third, the Markov assumption is problematic due to long-distance dependencies. Compare the phrase “dog runs” vs “dogs run”. Clearly, the verb suffix depends on the noun suffix (and vice versa). Trigram models are able to capture this dependency. However, if you center-embed prepositional phrases, e.g., “dog/s that live on my street and bark incessantly at night run/s”, N-grams fail to capture this dependency.

Despite these limitations, language models “just work” in a surprising diversity of applications. These models are particularly relevant today because it turns out that Deep Learning sequence models like LSTMs share much in common with VOMs. But that is a story we shall have to take up next time.

Until then.



An Introduction to Generative Syntax

Part Of: Language sequence
Content Summary: 900 words, 9 min read

Syntax vs Semantics

In language, we distinguish between syntax (structure) and semantics (meaning).

Compare the following:

  • “Colorless green ideas sleep furiously”
  • “Sleep ideas colorless green furiously”

Both sentences are nonsensical (a semantic transgression). But the first is grammatically correct, whereas the second is malformed.

The brain responds differently to errors of syntax and semantics, as measured by an EEG machine. Semantic errors produce a negative voltage after 400 milliseconds (“N400”); syntactic errors produce a positive voltage after 600 milliseconds (“P600”):

Syntax- Linguistic ERPs (1)

Parts of Speech

To understand syntax more precisely, we must differentiate parts of speech. Consider the following categories:

  • Noun (N).  cat, book, computer, peace, …
  • Verb (V). jump, chase, eat, sleep, …
  • Adjective (A). long, purple, young, old, …
  • Determiner (D) the, this, many, all, …
  • Preposition (P) in, on, to, for, with…

Nouns and verbs correspond to perception- and action- representations, respectively. They are an expression of the perception-action cycle. But to study syntax, it helps to put aside semantic context, and explore how parts of speech relate to one another.

Phrases as Color Patterns

To understand syntax intuitively, start by adding color to sentences.  Then try to find patterns of color unique to well-formed sentences.

Let’s get started!

Syntax- Noun Phrase Abstraction (3)

“Noun-like” groups of words appear on either side of the verb. Let noun phrase (NP) denote such a group. Optional parts of speech are indicated by the parentheses. Thus, our grammar contains the following rules:

  1. S → NP V NP
  2. NP → (D) (A) N

These rules explain why the following sentences feel malformed:

  • “Chase dogs cats” (violates rule 1)
  • “Old some dogs chase cats” (violates rule 2)

But these rules don’t capture regularities in how verbs are expressed. Consider the following sentences:

Syntax- Verb Phrase Abstraction (1)

A verb phrase contains a verb, optionally followed by a noun, and/or a preposition.

  1. S → NP VP
  2. NP → (D) (A) N
  3. VP → V (NP) (P NP)

This is better. Did you notice how we improved our sentence (S) rule? 🙂 Subject-only sentences (e.g. “She ran”) are now recognized as legal.

Prepositions are not limited to verb phrases, though. They also occur in noun phrases. Consider the following:

Syntax- Prepositional Phrase Abstraction

Prepositions are sometimes “attached to” a noun phrase. We express these as a prepositional phrase, which includes a preposition (e.g. “on”) and an optional noun phrase (e.g. “the table”).

  1. S → NP VP
  2. NP → (D) (A) N (PP)
  3. VP → V (NP) (PP)
  4. PP → P (NP)

Notice how we cleaned up the VP rule, and improved the NP rule.

Congratulations! You have discovered the rules of English. Of course, a perfectly complete grammar must include determiners (e.g., “yours”), conjunction (e.g., “and”), interjection (e.g., “wow!”). But these are fairly straightforward extensions to the above system.

These grammatical rules need not only interest English speakers. As we will see later, a variant of these rules appear in all known human languages. This remarkable finding is known as universal grammar. Language acquisition is not about reconstructing syntax rules from scratch. Rather, it is about learning the parameters by which your particular natural language (English, Chinese, Egyptian) varies from the universal script.

From Rules to Trees

Our four rules are polymorphic: they permit more than one kind of structure. Unique rule sets are easier to analyze, so let’s translate our rules into this format:

Syntax- Compressed vs Unique Ruleset (1)


Importantly, we can conceive of these unique rules as directions to construct a tree. We can conceive of the sentence “Dogs chase cats” as:

Syntax- Simple Tree (1)

Sentences are trees. These trees are not merely used to verify whether grammatical correctness. They play a role in speech production: which transforms the language of thought (Mentalese) to natural language (e.g., English). For more on this, see my discussion of the Tripartite Mind.

How can (massively parallel) conscious thought be made into (painfully serial) speech utterances? With syntax! Simply take the concepts you desire to communicate, and construct a tree based on (a common set of) syntactical rules.


Tree construction provides much more clarity on the phenomena of wordplay (linguistic ambiguity). Consider the sentence “I shot a wolf in my pajamas”. Was the gun fired while you were wearing pajamas? Or was the wolf dressed in pajamas?

Syntax- Multiple Interpretation Ambiguity

Both interpretations agree on parts of speech (colors). It is the higher-order structure that admits multiple choices. In practice, semantics constrain syntax: we tend to select the interpretation is feels the most intuitive.

The Sociology of Linguistics

The above presentation uses a simple grammar, for pedagogic reasons. I will at some point explain the popular X’ theory (pronounced “X bar”), which explores similarities between different phrase structures (e.g., NP vs PP). Indeed, there is a wide swathe of possible grammars that we will explore.

Syntax- Sociology of Linguistic Research

Generative grammar is part of the Symbolist tribe of machine learning. As such, this field has rich connections with algebra, production systems, and logic. For example, propositional logic was designed as the logic of sentences; predicate logic is the logic of phrases.

Other tribes besides the Symbolists care about language and grammar, of course. Natural Language Processing (NLP) and computational linguistics have been heavily influenced by the Bayesian tribe, and use probabilitic grammars (i.e., PCFGs).

More recently, the Connectionist tribe (and deep learning technologies) are taking a swing at producing language. In fact, I suspect neural network interpretability will only be achieved once a Connectionist account of language production has matured.


  • Language can be understood via syntax (structure) and semantics (meaning).
  • Syntax requires delineating parts of speech (e.g., nouns vs verbs).
  • Parts of speech occur in patterns called phrases. We can express these patterns as the rules of syntax.
  • Sentences are trees. Syntax rules are instructions for tree construction.
  • Sentence-trees provide insight into problems like sentence ambiguity.

For more resources on syntax trees, I recommend this lecture, this website, and this Youtube channel.

Until next time.

An Introduction to Probability Theory

Part Of: Statistics sequence
Related To: An Introduction to Set Theory
Content Summary: 400 words, 4 min read.

“Probability theory is nothing but common sense reduced to calculation.” – Laplace

Introducing Probability Theory

Probability theory, as formulated by Andrey Kolmogorov in 1925, has two ingredients:

  1. A space which define the mathematical objects (“the nouns”)
  2. Axioms which define the mathematical operations (“the verbs”)

A probability space is a 3-tuple (Ω,𝓕,P):

  1. Sample Space (Ω): A set of possible outcomes, from one or more events. Outcomes in Ω must be mutually exclusive and collectively exhaustive.
  2. σ-Algebra (𝓕). A collection of event groupings, or subsets. If Ω is countable, this can simply be the power set, otherwise a Borel algebra is often used.
  3. Probability Measure Function (P). A real-valued function P: Ω → ℝ which maps from events to real numbers.

The Kolmogorov axioms provide “rules of behavior” for the residents of probability space:

  1. Non-negativity: probabilities can never be negative, P(x) >= 0.
  2. Unitarity: the sum of all probabilities is 1.0 (“something has to happen”)
  3. Sigma Additivity: the probability of composite events equals the sum of their individual probabilities.


Random Variables

A random variable is a real-valued function X: Ω → ℝ. A random variable is a function, but not a probability function. Rather, instantiating random variables X = x defines a subset of events ⍵ ∈ Ω such that X(⍵) = x. Thus x picks out the preimage of Ω. Variable instantiation thus provides a language to select groups of events from Ω.

Random variables with discrete outcomes (countably finite Ω) are known as discrete random variable. We can define probability mass functions (PMFs) such that

f_X(x) = P(X=x) = P( { \omega \in \Omega : X(\omega) = x } )

In contrast, continuous random variables have continuous outcomes (uncountable Ω). For this class of variable, the probability of any particular event is undefined. Instead, we must define probabilities against a particular interval. The probability of 5.0000000… inches of snow is 0%; it is more meaningful to discuss the probability of 5 ± 0.5 inches of snowfall. Thus, we define probability density functions (PDFs) such that:

P[a \leq X \leq b] = \int f_X(x) dx

We can summarize discrete PMFs and continuous PDFs in the following graphic:


Marginal Probabilities

Consider two random variables, A and B ∈ Ω. Several operators may act on these variables, which parallel similar devices in Boolean algebra and set theory.


Suppose we want to know the probability of either A or B occurring. For this, we rely on the Set Combination Theorem:


Union involves subtracting the intersection; else the purple region is counted twice. In our post on set theory, we saw this same idea expressed as the inclusion-exclusion principle (Definition 13).


This first post in a two part explored the first six concepts or probability theory. Next time, we will learn about concepts 7-12.


These definitions and theorems are the cornerstone upon which much reasoning are built. It pays to learn them well.

Related Work

Codes and Communication

Part Of: Information Theory sequence
Content Summary: 1000 words, 10 min read

History of Communication Systems

Arguably, three pillars of modernity are: industrialization, democratic government, and communication technology. Today, we examine the latter.

Before 1860, long-distance communication required travel. This made communication across large nations quite challenging. Consider, for example, the continental United States. In 1841, it took four months for the news of the death of President Harrison to reach Los Angeles.

The Pony Express (a mail service built on horsepower) improved wait times to ten days. But it was the telegraph that changed the game. The key idea was to send messages on paper, but rather through voltage spikes in electric cables. Electrical pulses travel at near the speed of light.

In 1861, the first transcontinental cable was complete, and instantaneous communication became possible. The Pony Express closed its doors two days later.

It is hard to understate the impact of this technology. These advances greatly promoted information sharing, economic development, and improved governance.

By 1891, thousands of miles of cable had been lain underwater. These pipelines have only become more numerous and powerful over the years. Without them, the Internet would simply be impossible.


Today, we strive to understand the maths of communication. 

Understanding Communication

We start with the basics.

What is communication? The transmission of linguistic information.  

What is language? A shared system of reference communicated through symbols.

References (e.g., words) are functions that maps itself to an aspect of the physical world. References can denote both objects and actions.

Consider the power set of symbols (all possible combinations of letters). Words represent a subset of this object (a family of sets over an alphabet).

Symbol recognition is medium independent. For example, a word can be expressed either through writing (graphemes) or spoken language (phonemes).


References are the basis of memory. They together build representations of the physical world.

All complex nervous systems construct references. Some animals can communicate (share references). Only humans do so robustly, via syntax.

Semantic interpretations are not restricted to biology. Computers can refer as well. Reference is made possible by symbol grounding.

As the substrate of reference, symbols are the basis of computation. All answerable questions can be solved by a Turing machine.

Semantic aspects of communication are irrelevant to the engineering problem. Coding theory studies symbol sets (alphabets) directly.

Comparing Alphabets

How to compare languages? Let’s find out!

There are 26 symbols in the English alphabet. How many possible three-letter words are there? The answer is 26^3 = 17,576 possible words. More generally:

Possible Messages (M) = Alphabet Size (a) ^ Number of Symbols (X)

M = aX

Log(M) = Loga(X)

Information is the selection of specific words (“red”) from the space of possible words.

We might be tempted to associate information with W. But we desire information to scale linearly with length. Two books should contain twice as much information as one. So we say information is log(M).

I(X, a) = Loga(X)

Alphabet size (logarithmic base) is not very important in this function. Suppose we choose some other base B instead. We can compare alphabets by converting logarithmic base.

Base Conversion: Logb(X) = Loga(X) / Loga(b)

I(X, a) = Loga(X) = Logb(a) * Logb(X)

I(X) = K Logb(X) where K equals Logb(a)

I(X) is known as Shannon information.

We can compare the expressive power of different alphabets. The modern Hawaiian alphabet, for example, has 13 letters. So there are only 13^3 = 2,197 possible three-letter Hawaiian words. The information provided by these respective languages is:

I(Xhawaiian) = Log13(X)

I(Xenglish) = Log13(26) * Log13(X)

I(Xenglish) / I(Xhawaiian) = Log13(26) = 1.270238

We expect English words to be 27% more information than Hawaiian, on average. And indeed, this is precisely what we find:

With 3 English letters: 26^3 = 17,576 possible words
With 3.81 Hawaiian letters: 13^(3*1.270238) = 17,576 possible words

Translating Between Codes

How does one translate between languages? Consider the word “red”. In Hawaiian, this word is “ula’ula”. We might construct the following function:

  • r → ula’
  • e → ul
  • d → a

But this fails to generalize. The Hawaiian word for rice is “laiki”, which does not begin with a ‘u’.

In general, for natural languages any function f: AE → AH is impossible. Why? Because words (references) map to physical reality in arbitrary ways. Two natural languages are too semantically constrained to afford a simple alphabet-based translation.


Alphabet-based translations are possible, however, if you use a thin language. Thin languages only refer when converted back into its host language. Binary is a classic example of a thin language. It has the smallest possible alphabet (size two).


An encoding is a function of type f: AE → AH. For an example, consider ASCII. This simple encoding is at the root of most modern technologies (including UTF-8, which you are using to view this webpage):


Noise and Discriminability

A communication system has five components: source, transmitter, channel, receiver, and destination.

source and destination typically share a common system of reference. Imagine two people with the same interpretation of the word “red”, or two computers with the same interpretation of the instruction “lb” (load byte).

Transmitter and receiver also tend to play reciprocal roles. Information is exchanged through the channel (e.g., sound waves, cable).


Receivers reconstruct symbols from the physical medium. Noise causes decoding errors.

How can the transmitter protect the message from error? By maximizing the physical differences between symbols. This is the discriminability principle.

Communication- Discriminability (1).png

This principle explains why binary is employed by computers and telecommunications. A smaller alphabet improves symbol discriminability, which combats the effect of noise.


  • Language is a shared system of reference communicated through symbols 
  • References are functions that maps itself to an aspect of the physical world. 
  • Symbol recognition is medium independent
  • Alphabet size determines expressive power (how many messages are possible)
  • An encoding lets you alter (often reduce) language’s alphabet.
  • Such encodings are often desirable because they protect messages from noise.

An Introduction To Energy

Part Of: Demystifying Physics sequence
Content Summary: 700 words, 7min reading time.

Energy As Universal Currency

Why does burning gasoline allow a car to move? Chemical reactions and kinetic propulsion seem quite distinct.

How does a magnet pull a nail from the ground? What relation exists between magnetism and gravitational pull?

What must occur for a nuclear reactor to illuminate a light bulb? What connection is there between nuclear physics and light waves?

Energy is the hypothesis of a hidden commonality among the above phenomena. There are many forms of energy: kinetic, electric, chemical, gravitational, magnetic, radiant. But these forms are expressions of a single underlying phenomena.

A single object may possess many different energy forms simultaneously:

A block of wood thrown into the air will possess kinetic energy because of its motion, gravitational potential energy because of its height above the ground, chemical energy in the wood (which can be burned), heat energy depending on its temperature, and nuclear energy in its atoms (this form is not readily available from our block of wood, but the other forms may be).

Non-physicists worry that physics involves memorizing a giant catalogue of phenomena, each discovered by some guy who got an equation named after him. Energy is the reason why physics is not “stamp collecting”. It allows us to seamlessly switch between different phenomena.

Change As Energy Transformation

The only thing that is constant is change.

Certain Greek philosophers were obsessed with change. Physics gives us a language that formalizes these intuitions. Consider the following.

  • Energy is the capacity to do work; that is, initiate processes.
  • Processes (i.e., work) involve continuous and controlled actions, or changes.

A confusing diversity of phenomena can produce force (the ability to accelerate things), but they all share the same blood. Change is energy transformation.

We can play “where does the energy come from” game on literally any event in the physical universe:


A Worked Example

Let’s get concrete, and go through a simple illustration of work as energy transformation. 

Our example involves a ball sliding down an incline.


To get our bearings, let’s calculate the acceleration experienced by the ball


To demonstrate conservation of energy, we first need to solve for the ball’s final velocity (v) at the bottom of the ramp. Recall the lesson of kinematic calculus, that displacement, velocity, and acceleration are intimately related:

x(t) = \int v(t) = \int \int a(t)

a(t) = v'(t) = x''(t)

We can use these formulae to calculate final velocity (for a tutorial see here).


So the ball’s final velocity will be 6.26 meters per second \sqrt{4g} m/s). However, recall the classical definitions of kinetic and potential (gravitational) energy, which are KE = \frac{1}{2}mv^2 and PE = mgh.

If conservation of energy is true, then we should expect the following to hold:

(KE + PE)_{final} - (KE + PE)_{initial} = 0

Is this in fact the case? Yes!

m[(0 + 2g) - (\frac{ \sqrt{(4g)^2} }{2})] + 0 = m[2g - 2g] = 0

Total energy of the ball stays the same across these two points. Conservation of energy can also be demonstrated at any other time in the ball’s journey. In fact, we can show that potential energy is smoothly converted to kinetic energy (image credit Physics Classroom).


Why is the ball transforming its energy KE \rightarrow PE? Because it is experiencing a force.

Cosmological Interpretation

Just as space and time are expressions of a single spacetime fabric, Einstein also demonstrated mass-energy equivalence. Mass is simply a condensed form of energy, capable of being released in e.g., nuclear explosions. Do not confuse mass and matter. Mass and energy are (interchangeable) properties of matter.

So there are two components of reality: energymass and spacetime. Spacetime bends for energymass. Energy is conserved, but has many faces. The laws of physics (quantum field theory and general relativity) describe the interactions between energymass and spacetime. And since we know that energy is conserved, all there is today was all there was at the beginning of time.

If the amount of energy contained in our universe doesn’t change, what is this quantity? How much energy is there? One strong candidate theory is zero. The flat universe hypothesis rests on the realization that gravitational energy is negative, and claims that it counterbalances all other mediums.


Next time, we will explore the distinction between usable vs inaccessible energy. Until then.

An Introduction to Prospect Theory

Part Of: [Neuroeconomics] sequence
Content Summary: 1500 words, 15 min reading time


Decisions are bridges between perception and action. Not all decisions are cognitive. Instead, they occur at all levels of the abstraction hierarchy, and include things like reflexes. 

Theories of decision tend to constrain themselves to cognitive phenomena. They come in two flavors: descriptive (“how does it happen”) and normative (“how should it happen”).

Decision making often occur in the context of imperfect knowledge. We may use probability theory as a language to reason about uncertainty. 

Let risk denote variance in the probability distribution of possible outcomes. Risk can exist regardless of whether a potential loss is involved. For example, a prospect that offers a 50-50 chance of paying $100 or nothing is more risky than a prospect that offers $50 for sure – even though the risky prospect entails no possibility of losing money.

Today, we will explore the history of decision theory, and the emergence of prospect theory. As the cornerstone of behavioral economics, prospect theory provides an important theoretical surface to the emerging discipline of neuroeconomics.

Maximizing Profit with Expected Value

Decision theories date back to the 17th century, and a correspondence between Pascal and Fermat. There, consumers were expected to maximize expected value (EV), which is defined as probability p multiplied by outcome value x.

EV = px

To illustrate, consider the following lottery tickets:


Suppose each ticket costs 50 cents, and you have one million dollars to spend. Crucially, it doesn’t matter which ticket you buy! Each of these tickets have the same expected value: $1. Thus, it doesn’t matter if you spend the million dollars on A, B, or C – each leads to the same amount of profit.

The above tickets have equal expected value, but they do not have equal risk. We call people who prefer choice A risk averse; whereas someone who prefers C is risk seeking.

Introducing Expected Utility

Economic transactions can be difficult to evaluate. When trading an apple for an orange, which is more valuable? That depends on a person’s unique tastes. In other words, value is subjective.

Let utility represent subjective value. We can treat utility as a function u() that operates on objective outcome x. Expected utility, then, is highly analogous to expected value:

EU = pu(x)

Most economists treat utility functions as abstractions: people act as if motivated by a utility function. Neuroeconomic research, however, suggests that utility functions are physically constructed by the brain.

Every person’s utility function may be different. If a person’s utility curve is linear, then expected utility converges onto expected value:

EU \rightarrow EV \mid u(x) = x

Recall in the above lottery, the behavioral distinction between risk-seeking (preferring ticket A) and risk-averse (preferring C). Well, in practice most people prefer A. Why?

We can explain this behave by appealing to the shape of the utility curve! Utility convexity produces risk aversion:

Prospect Theory- Utility Convexity & Risk Aversion

In the above, we see the first $50 (first vertical line) produces more utility (first horizontal line) than the second $50.

Intuitively, the first $50 is needed more than the second $50. The larger your wealth, the less your need. This phenomenon is known as diminishing marginal returns.

Neoclassical Economics

In 1947, von Neumann and Morgenstern formulated a set of axioms that are both necessary and sufficient for representing a decision-maker’s choices by the maximization of expected utility.

Specifically, if you assume an agent’s preference set accomodates these axioms…

1. Completeness. People have preferences over all lotteries.

\forall L_1, L_2 \in L either L_1 \leq L_2 or L_1 \geq L_1 or L_1 = L_2

2. Transitivity. Preferences are expressed consistently.

\forall L_1, L_2, L_3 \in L if L_1 \leq L_2 and L_1 \leq L_2 then L_1 \leq L_3

3. Continuity. Preferences are expressed as probabilities.

L_1, L_2, L_3 \in L then \exists \alpha, B  s.t. L_1 \geq L_2 \geq L_3 iff \alpha L_1 + (1-\alpha)L_3 \geq L_2 \geq BL_1 + (1 - B)L_3

4. Independence of Irrelevant Alternatives (IIA). Binary preferences don’t change by injecting a third lottery.

… then those preferences always maximize expected utility.

L_1 \geq L_2 iff sum(p_1u(x_1) \geq p_2u(x_2)

The above axioms constitute expected utility theory, and form the cornerstone for neoclassical economics.  Expected utility theory bills itself as both a normative and descriptive theory: that we understand human decision making, and have a language to explain why it is correct.

Challenges To Independence Axiom

In the 1970s, expected utility theory came under heavy fire for failing to predict human behavior. The emerging school of behavioral economics gathered empirical evidence that Neumann-Morgenstern axioms were routinely violated in practice, especially the Independence Axiom (IIA).

For example, the Allais paradox asks our preferences for the following choices:


Most people prefer A (“certain win”) and D (“bigger number”). But these preferences are inconsistent, because C = 0.01A and D = 0.01B. The independence axiom instead predicts that A ≽ B if and only if C ≽ D.

The Decoy effect is best illustrated with popcorn:


Towards a Value Function

Concurrently to these criticisms of the independence axiom, the heuristics and biases literature (led by Kahneman and Tversky) began to discover new behaviors that demanded explanation:

  • Risk Aversion. In most decisions, people tend to prefer smaller variance in outcomes.
  • Everyone prefers gains over losses, of course. Loss Aversion reflects that losses are felt more intensely than gains of equal magnitude.
  • The Endowment Effect. Things you own are intrinsically valued more highly. Framing decisions as gains or as losses affects choice behavior.

Prospect Theory- Behavioral Effects Economic Biases (1)

Each of these behavioral findings violate the Independence Axiom (IIA), and cumulatively demanded a new theory. And in 1979, Kahneman and Tversky put forward prospect theory to explain all of the above effects.

Their biggest innovation was to rethink the utility function. Do you recall how neoclassical economics appealed to u(x) convexity to explain risk aversion? Prospect theory takes this approach yet further, and seeks to explain all of the above behaviors using a more complex shape of the utility function. 

Let value function \textbf{v(x)} represent our updated notion of utility.  We can define expected prospect \textbf{EP} of a function as probability multiplied by the value function

EP = pv(x)

Terminology aside, each theory only differs in the shape of its outcome function.

Prospect Theory- Evolution of Utility Function (3)

Let us now look closer at the the shape of v(x):

Prospect Theory- Value Function.png

This shape allows us to explain the above behaviors:

The endowment effect captures the fact that we value things we own more highly. The reference point in v(x), where x = 0, captures the status quo. Thus, the reference point allows us to differentiate gains and losses, thereby producing the endowment effect.

Loss aversion captures the fact that losses are felt more strongly than gains.  The magnitude of v(x) is larger in the losses dimension. This asymmetry explains loss aversion.

We have already explained risk aversion by concavity of the utility function u(x). v(x) retains convexity for material gains. Thus, we have retained our ability to explain risk aversion in situations of possible gains. For losses, v(x) concavity predicts risk seeking.

Towards a Weight Function

Another behavioral discovery, however, immediately put prospect theory in doubt:

  • The Fourfold Pattern. For situations that involve very high or very low probabilities, participants often switch their approaches to risk.

To be specific, here are the four situations and their resultant behaviors:

  1. Fear of Disappointment. With a 95% chance to win $100, most people are risk averse.
  2. Hope To Avoid Loss. With a 95% chance to lose $100, most people are risk seeking.
  3. Hope Of Large Gain. With a 5% chance to win $100, most people are risk seeking.
  4. Fear of Large Loss. With a 5% chance to lose $100, most people are risk averse.

Crucially, v(x) fails to predict this behavior. As we saw in the previous section, it predicts risk aversion for gains, and risk seeking for losses:

Prospect Theory- Fourfold Pattern Actual vs Expected (2)

Failed predictions are not a death knell to a theory. Under certain conditions, they can inspire a theory to become stronger!

Prospect theory was improved by incorporating a more flexible weight function.

EP = pv(x) \rightarrow EP = w(p)v(x)

Where w(p) has the following shape:

Prospect Theory- Weight Function (1)These are in fact two weight functions:

  1. Explicit weights represent probabilities learned through language; e.g., when reading the sentence “there is a 5% chance of reward”.
  2. Implicit weights represent probabilities learned through experience, e.g., when the last 5 out of 100 trials yielded a reward.

This change adds some mathematical muscle to the ancient proverb:

Humans don’t handle extreme probabilities well.

And indeed, the explicit weight function successfully recovers the fourfold pattern:



Today we have reviewed theories of expected value, expected utility (neoclassical economics), and prospect theory. Each theory corresponds to a particular set of conceptual commitments, as well a particular formula:

EV = px

EU = pu(x)

EP = w(p)v(x)

However, we can unify these into a single value formula V:

V = w(p)v(x)

In this light, EV and EU have the same structure as prospect theory. Prospect theory distinguishes itself by using empirically motivated shapes:

Prospect Theory- Evolution of Both Functions

With these tools, prospect theory successfully recovers a wide swathe of economic behaviors.


Until next time.

Markov Decision Processes

Part Of: Reinforcement Learning sequence
Followup To: An Introduction To Markov Chains
Content Summary: 900 words, 9 min read


Today, we turn our gaze to Markov Decision Processes (MDPs), a decision-making environment which supports our propensity to learn from good and bad outcomes. We represent outcome desirability with a single number, R. This value is used to refine action selection: given a particular situation, what action will maximize expected reward?

In biology, we can describe the primary work performed by an organism is to maintain homeostasis: maintaining metabolic energy reserves, body temperature, etc in a widely varying world. 

Cybernetics provide a clear way of conceptualizing biological reward. In Neuroendocrine Integration, we discussed how brains must respond both to internal and external changes. This dichotomy expresses itself as two perception-action loops: a visceral body-oriented loop, and a cognitive world-centered one.

Rewards are computed by the visceral loop. To a first approximation, reward encode progress towards homeostasis. Food is perceived as more rewarding when the body is hungry, this is known as alliesthesia. Reward information is delivered to the cognitive loop, which helps refine its decision making.

Reinforcement Learning- Reward As Visceral Efferent

Extending Markov Chains

Recall that a Markov Chain contains a set of states S, and a transition model P. A Markov Decision Process (MDP) extends this device, by adding three new elements.

Specifically, an MDP is a 5-tuple (S, P, A, R, ɣ):

  • A set of states s ∈ S
  • A transition model Pa(s’ | s).
  • A set of actions a ∈ A
  • A reward function R(s, s’)
  • A discount factor ɣ

To illustrate, consider GridWorld. In this example, every location in this two-dimensional grid is a state, for example (1,0). State (3,0) is a desirable location: R(s(3,0)) = +1.0, but state (3,1) is undesirable, R(s(3,1)) = -1.0. All other states are neutral.

Gridworld supports four actions, or movements: up, down, left, and right.  However, locomotion is imperfect: if Up is selected, the agent will only move up with 80% probability: 20% of the time it will go left or right instead. Finally, attempting to move into a forbidden square will simply return the agent to its original location (“hitting the wall”).

Reinforcement Learning- Example MDP Gridworld

The core problem of MDPs is to find a policy (π), a function that specifies the agent’s response to all possible states. In general, policies should strive to maximize reward, e.g., something like this:

Reinforcement Learning- Example MDP Policy

Why is the policy at (2,2) Left instead of Up? Because (2,1) is dangerous: despite selecting Up, there is a 10% chance that the agent will accidentally move Right, and be punished.

Let’s now consider an environment with only three states A, B, and C.  First, notice how different policies change the resultant Markov Chain:


This observation is important. Policy determines the transition model.

Towards Policy Valuation V(s)

An agent seeks to maximize reward. But what does that mean, exactly?

Imagine an agent selects 𝝅1. Given the resultant Markov Chain, we already know how to use matrix multiplication to predict future locations St. The predicted reward Pt is simply the dot product of expected location and the reward function. 

P_t = S_t \cdot R


We might be tempted to define the value function V(S) as the sum of all predicted future rewards:

V_O(S) = P_0 + P_1 + P_2 + P_3 + \dots = \sum{P_k}

However, this approach is flawed.  Animals value temporal proximity: all else equal, we prefer to obtain rewards quickly. This is temporal discounting: as rewards are further removed from the present, their value is discounted. 

In reinforcement learning, we implement temporal discounting with the gamma parameter: rewards that are k timesteps away are multiplied by the exponential discount factor \gamma^k. The value function becomes:

V_O(S) = P_0 + \gamma P_1 + \gamma^2 P_2 + \gamma^3 P_3 + \dots = \sum{\gamma^k P_k}

Without temporal discounting, V(s) can approach infinity. But exponential discounting ensures V(s) equals a finite valueFinite valuations promote easier computation and comparison of state evaluations. For more on temporal discounting, and an alternative to the RL approach, see An Introduction to Hyperbolic Discounting.

Intertemporal Consistency

In our example, at time zero our agent starts in state A. We have already used linear algebra to compute our Pk predictions. To calculate value, we simply compute $latex \sum{\gamma^k P_k}$

V_0(A) = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3

Agents compute V(s) at every time step. At t=1, two valuations are relevant:

V_1(A) = 0 + 0 + 0.64 \gamma^2 + \dots

V_1(B) = 0 + 0.8 \gamma + 0.96 \gamma^2 + \dots


What is the relationship between the value functions at t=0 and t=1? To answer this, we need to multiply each term by \gamma P(X|A), where X is the state being considered at the next time step.

W_1(A) \triangleq \gamma 0.2 V_1(A)

W_1(A) = 0 + 0 + (0.2)(0.64)\gamma^3 + \dots


W_1(B) \triangleq \gamma P(B|A)V_1(B) = \gamma 0.8 V_1(B)

W_1(B) 0 + (0.8)(0.8) \gamma^2 + (0.8)(0.96) \gamma^3 + \dots

Critically, consider the sum X = r_0(s) + W_1(A) + W_1(B):

X = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3 + \dots

MDP- Intertemporal Consistency

Does X_0 look familiar? That’s because it equals V_0(A)! In this way, we have a way of equating a valuation at t=0 and t=1. This property is known as intertemporal consistency.

Bellman Equation

We have seen that V_0(A) = X_0. Let’s flesh out this equation, and generalize to time t.

V_t(s) = r_t(A) + \gamma \sum{P(s'|s)V_{t+1}(s')}

This is the Bellman Equation, and it is a central fixture in control systems. At its heart, we define value in terms of both immediate reward and future predicted value. We thereby break up a complex problem into small subproblems, a key optimization technique that can be approached with dynamic programming.

Next time, we will explore how reinforcement learning uses the Bellman Equation to learn strategies with which to engage its environment (the optimal policy 𝝅). See you then!