# Intro to Regularization

Part Of: Machine Learning sequence
Followup To: Bias vs Variance, Gradient Descent
Content Summary: 1100 words, 11 min read

In Intro to Gradient Descent, we discussed how loss functions allow optimization methods to locate high-performance models.

But in Bias vs Variance, we discussed how model performance isn’t the only thing that matters. Simplicity promotes generalizability.

One way to enhance simplicity is to receive the model discovered by gradient descent, and manually remove unnecessary parameters.

But we can do better. In order to automate parsimony, we can embed our preference for simplicity into the loss function itself.

But first, we need to quantify our intuitions about complexity.

Formalizing Complexity

Neural networks are often used as classification models against large numbers of images. The complexity of the models tends to correlate with the number of layers. For some models then, complexity is captured in the number of parameters.

While not used much in the industry, polynomial models are pedagogically useful examples of regression models. Here, the degree of the polynomial expresses the complexity of the model: a degree-eight polynomial has more “bumps” than a degree-two polynomial.

Consider, however, the difference between the following regression models

$y_A = 4x^4 + 0.0001x^3 + 0.0007x^2 + 2.1x + 7$

$y_B = 4x^4 + 2.1x + 7$

Model A uses five parameters; Model B uses three. But their predictions are, for all practical purposes, identical. Thus, the size of each parameter is also relevant to the question of complexity.

The above approaches rely on the model’s parameters (its “visceral organs”) to define complexity. But it is also possible to rely on the model’s outputs (its “behaviors”) to achieve the same task. Consider again the classification decision boundaries above. We can simply measure the spatial frequency (the “squiggliness” of the boundary) as another proxy towards complexity.

Here, then, are three possible criteria for complexity:

1. Number of parameters
2. Size of parameters
3. Spatial frequency of decision manifold

Thus, operationalizing the definition of “complexity” is surprisingly challenging.

Mechanized Parsimony

Recall our original notion of the performance-complexity quadrant. By defining our loss function exclusively in terms of the residual error, gradient descent learns to prefer accurate models (to “move upward”). Is there a way to induce leftward movement as well?

To have gradient descent respond to both criteria, we can embed them into the loss function. One simple way to accomplish this: addition.

This technique is an example of regularization.

Depending on the application, sometimes the errors are much larger than the parameters or vice versa. In order to assure the right balance between these terms, people usually add a hyperparameter to the regularized loss function $J = \|e\|_2 + \lambda \|\theta\|_2$

A Geometric Interpretation

Recall Einstein’s insight that gravity is curvature of spacetime. You can envision such curvature as a ball pulling on a sheet. Here is the gravity well of bodies of the solar system:

Every mass pulls on every other mass! Despite the appearance of the above, Earth does “pull on” Saturn.

The unregularized cost function we saw last time creates a convex loss function, which we’ll interpret as a gravity well centered around parameters of best fit. If we replace J with a function that only penalizes complexity, a corresponding gravity well appears, centered around parameters of zero size.

If we keep both terms, we see the loss surface now has two enmeshed gravity wells. If scaled appropriately, the “zero attractor” will pull the most performant solution (here $\theta = (8,7)$ towards a not-much-worse yet simpler model $\theta = (4,5)$.

More on L1 vs L2

Previously, I introduced the L1 norm aka mean average error MAE

$\|x\|_1 = (\sum_{i=1}^{n} \lvert x_i\rvert^1)^1$

Another loss function is the L2 norm aka root mean squared error RMSE

$\|x\|_2 = (\sum_{i=1}^{n} \lvert x_i\rvert^2)^{1/2}$

The L1 and L2 norms respectively correspond to Euclidean vs Manhattan distance (roughly, plane vs car travel):

One useful way to view norms is by their isosurface. If you can travel in any direction for a finite amount of time, the isosurface is the frontier you might sketch.

The L2 isosurface is a circle. The L1 isosurface is a diamond.

• If you don’t change direction, you can travel the “normal” L2 distance.
• If you do change direction, your travel becomes inefficient (since “diagonal” travel along the hypotenuse is forbidden).

The Lp Norm as Superellipse

Consider again the formulae for the L1 and L2 norm. We can generalize these as special cases of the Lp norm:

$\|x\|_p = (\sum_{i=1}^{n} \lvert x_i\rvert^p)^{1/p}$

Here are isosurfaces of six exemplars of this norm family:

On inspection, the above image looks like a square that’s inflating with increasing p. In fact, the Lp norm generates a superellipse.

As an aside, note that the boundaries of the Lp norm family operationalize complexity rather “intuitively”. For the L0 norm, complexity is the number of non-zero parameters. For the Linf norm, complexity is the size of the largest parameter.

Lasso vs Ridge Regression

Why the detour into geometry?

Well, so far, we’ve expressed regularization as $J = \|e\|_p + \lambda \| \theta \|_p$ But most engineers choose between the L1 and L2 norms. The L1 norm is not convex (bowl shaped), which tends to make gradient descent more difficult. But the L1 norm is also more robust to outliers, and has other benefits.

Here are two options for the residual norm:

• $\|e\|_2$: sensitive to outliers, but a stable solution
• $\|e\|_1$: robust to outlier, but an unstable solution

The instability of $\|e\|_1$ tends to be particularly thorny in practice, so $latex \|e\|_2$ is almost always chosen.

That leaves us with two remaining choices:

• Ridge Regression: $J = \|e\|_2 + \|\theta\|_2$ : computationally inefficient, but sparse output.
• Lasso Regression: $J = \|e\|_2 + \|\theta\|_1$: computationally efficient, non-sparse output

What does sparse output mean? For a given model type, say $y = ax^3 + bx^2 + cx + d$ with parameters $(a, b, c, d)$, Ridge regression might output parameters $(3, 0.5, 7.8, -0.4)$ whereas Lasso might give me $(3, 0, 7.8, 0)$ . In effect, Ridge regression is performing feature selection: locating parameters that can be safely removed. Why should this be?

Geometry to the rescue!

In ridge regression, both gravity wells have convex isosurfaces. Their compromises are reached anywhere in the loss surface. In lasso regression, the diamond-shaped complexity isosurface tends to push compromises towards axes where $\theta_i = 0$. (In higher dimensions, the same geometry applies).

Both Ridge and Lasso regression are used in practice. The details of your application should influence your choice. I’ll also note in passing that “compromise algorithms” like Elastic Net exist, that tries to capture the best parts of either algorithm.

Takeaways

I hope you enjoyed this whirlwind tour of regularization. For a more detailed look at ridge vs lasso, I recommend reading this.

Until next time.

# An Introduction to Language Models

Part Of: Language sequence
Content Summary: 1500 words, 15 min read

Why Language Models?

In the English language, ‘e’ appears more frequently than ‘z’. Similarly,  “the” occurs more frequently than “octopus”. By examining large volumes of text, we can learn the probability distributions of characters and words.

Roughly speaking, statistical structure is distance from maximal entropy. The fact that the above distributions are non-uniform means that English is internally recoverable: if noise corrupts part of a message, the surrounding can be used to recover the original signal. Statistical structure is also used to reverse engineer secret codes such as the Roman cipher.

We can illustrate the predictability of English by generating text based on the above probability distributions. As you factor in more of the surrounding context, the utterances begin to sound less alien, and more like natural language.

A language model exploits the statistical structure of a language to express the following:

• Assign a probability to a sentence $P(w_1, w_2, w_3, \ldots w_N)$
• Assign probability of an upcoming word $P(w_4 \mid w_1, w_2, w_3)$

Language models are particularly useful in language perception, because they can help interpret ambiguous utterances. Three such applications might be,

• Machine Translation: $P(\text{high winds tonight}) > P(\text{large winds tonight})$
• Spelling correction: $P(\text{fifteen minutes from}) > P(\text{fifteen minuets from})$
• Speech Recognition: $P(\text{I saw a van}) > P(\text{eyes awe of an})$

Language models can also aid in language production. One example of this is autocomplete-based typing assistants, commonly displayed within text messaging applications.

Towards N-Grams

A sentence is a sequence of words $\textbf{w} = (w_1, w_2, \ldots, w_3)$. To model the joint probability over this sequence, we use the chain rule:

$p(\text{this is the house})$

$= p(\text{this})p(\text{is}\mid\text{this})p(\text{the}\mid\text{this is})p(\text{house}\mid\text{this is the})$

As the number of words grows, the size of our conditional probability tables (CPTs) quickly becomes intractable. What is to be done? Well, recall the Markov assumption we introduced in Markov chains.

The Markov assumption constrains the size of our CPTs. However, sometimes we want to condition on more (or less!) than just one previous word. Let $v$ denote how many variables we admit in our context. A variable order Markov model (VOM) allows $v$ elements in its context: $p(s_{t+1} | s_{t-v}, \ldots, s_{t})$. Then the size of our CPT is $n=v+1$, because we must take our original variable into account. Thus an N-gram is defined as a $v$-order Markov model. By far, the most common choices are trigrams, bigrams, and unigrams:

We have already discussed Markov Decision Processes, used in reinforcement learning applications.  We haven’t yet discussed MRFs and HMMs. VOMs represent a fourth extension: the formalization of N-grams. Hopefully you are starting to appreciate the  richness of this “formalism family”. 🙂

Estimation and Generation

How can we estimate these probabilities? By counting!

Let’s consider a simple bigram language model. Imagine training on this corpus:

This is the cheese.

That lay in the house that Alice built.

Suppose our trained LM encounters the new sentence “this is the house”. It estimates its probability as:

$p(\text{this is the house})$

$= p(\text{this})p(\text{is} \mid \text{this})p(\text{the} \mid \text{is})p(\text{house} \mid \text{the})$

$= \dfrac{1}{12} * 1 * 1 * \dfrac{1}{2} = \dfrac{1}{24}$

How many problems do you see with this model? Let me discuss two.

First, we have estimated that $p(\text{this}) = \dfrac{1}{24}$. And it is true that “this” occurs only once in our toy corpus above. But out of two sentences, “this” leads half of them. We can express this fact by adding a special START token into our vocabulary.

Second, recall what happens when language models generate speech. Once they begin a sentence, they are unable to end it! Adding a new END token will allow our model the terminate a sentence, and begin a new one.

With these new tokens in hand, we update our products as follows:

A couple other “bug fixes” I’ll mention in passing:

• Out-of-vocabulary words are given zero probability. It helps to add an unknown  (UNK) pseudoword and assign it some probability mass.
• LMs prefer very short sentences (sequential multiplication is monotonic decreasing). We can address this e.g., normalizing by sentence length.

Smoothing

In the last sentence in the image above, we estimate $p(END|house) = 0$, because we have no instances of this two-word sequence in our toy corpus. But this causes our language model to fail catastrophically: the sentence is deemed impossible (0% probability).

This problem of zero probability increases as we increase the complexity of our N-grams. Trigram models are more accurate than bigrams, but produce more $p=0$ events. You’ll notice echoes of the bias-variance (accuracy-generalization) tradeoff.

How can we remove zero counts? Why not add one to every word? Of course, we’d then need to increase the size of our denominator, to ensure the probabilities still sum to one. This is Laplace smoothing

In a later post, we will explore how (in a Bayesian framework) such smoothing algorithms can be interpreted as a form of regularization (MAP vs MLE).

Due to its simplicity, Laplace smoothing is well-known  But several algorithms achieve better performance.  How do they approach smoothing?

Recall that a zero count event in an $N$-gram is not likely to occur in $(N-1)$-gram model. For example, it is very possible that the phrase “dancing were thought” hasn’t been seen before.

While a trigram model may balk at the above sentence, we can fall back on the bigram and/or unigram models. This technique underlies the Stupid Backoff algorithm.

As another variant on this theme, some smoothing algorithms train multiple $N$-grams, and essentially use interpolation as an ensembling method. Such models include Good-Turing and Kneser-Ney algorithms.

Beam Search

We have so far seen examples of language perception, which assigns probabilities to text. Let us consider language perception, which generates text from the probabilistic model. Consider machine translation. For a French sentence $\textbf{x}$, we want to produce the English sentence $\textbf{y}$ such that $y^* = \text{argmax } p(y\mid x)$.

This seemingly innocent expression conceals a truly monstrous search space. Deterministic search has us examine every possible English sentence. For a vocabulary size $V$, there are $V^2$ possible two-word sentences. For sentences of length $n$, our time complexity of our brute force algorithm is $O(V^n)$.

Since deterministic search is so costly, we might consider greedy search instead. Consider an example French sentence $\textbf{x}$ “Jane visite l’Afrique en Septembre”. Three candidate translations might be,

• $y^A$: Jane is visiting Africa in September
• $y^B$: Jane is going to Africa in September
• $y^C$: In September, Jane went to Africa

Of these, $p(y^A|x)$ is the best (most probable) translation. We would like greedy search to recover it.

Greedy search generates the English translation, one word at a time. If “Jane” is the most probable first word $\text{argmax } p(w_1 \mid x)$, then the next word generated is $\text{argmax } p(w_2 \mid \text{Jane}, x)$. However, it is not difficult to contemplate $p(\text{going}\mid\text{Jane is}) > p(\text{visiting}\mid\text{Jane is})$, since the word “going” is used so much more frequently in everyday conversation. These problems of local optima happen surprisingly often.

The deterministic search space is too large, and greedy search is too confining. Let’s look for a common ground.

Beam search resembles greedy search in that it generates words sequentially. Whereas greedy search only drills one such path in the search tree, beam search drills a finite number of paths. Consider the following example with beamwidth $b=3$

As you can see, beam search elects to explore $y^A$ as a “second rate” translation candidate despite $y^B$ initially receiving the most probability mass. Only later in the sentence does the language model discover the virtues of the $y^A$ translation. 🙂

Strengths and Weaknesses

Language models have three very significant weaknesses.

First, language models are blind to syntax. They don’t even have a concept of nouns vs. verbs!  You have to look elsewhere to find representations of pretty much any latent structure discovered by linguistic and psycholinguistic research.

Second, language models are blind to semantics and pragmatics. This is particularly evident in the case of language production: try having your SMS autocomplete write out an entire sentence for you. In the real world, communication is more constrained: we choose the most likely word given the semantic content we wish to express right now.

Third, the Markov assumption is problematic due to long-distance dependencies. Compare the phrase “dog runs” vs “dogs run”. Clearly, the verb suffix depends on the noun suffix (and vice versa). Trigram models are able to capture this dependency. However, if you center-embed prepositional phrases, e.g., “dog/s that live on my street and bark incessantly at night run/s”, N-grams fail to capture this dependency.

Despite these limitations, language models “just work” in a surprising diversity of applications. These models are particularly relevant today because it turns out that Deep Learning sequence models like LSTMs share much in common with VOMs. But that is a story we shall have to take up next time.

Until then.

# An Introduction to Generative Syntax

Part Of: Language sequence
Content Summary: 900 words, 9 min read

Syntax vs Semantics

In language, we distinguish between syntax (structure) and semantics (meaning).

Compare the following:

• “Colorless green ideas sleep furiously”
• “Sleep ideas colorless green furiously”

Both sentences are nonsensical (a semantic transgression). But the first is grammatically correct, whereas the second is malformed.

The brain responds differently to errors of syntax and semantics, as measured by an EEG machine. Semantic errors produce a negative voltage after 400 milliseconds (“N400”); syntactic errors produce a positive voltage after 600 milliseconds (“P600”):

Parts of Speech

To understand syntax more precisely, we must differentiate parts of speech. Consider the following categories:

• Noun (N).  cat, book, computer, peace, …
• Verb (V). jump, chase, eat, sleep, …
• Adjective (A). long, purple, young, old, …
• Determiner (D) the, this, many, all, …
• Preposition (P) in, on, to, for, with…

Nouns and verbs correspond to perception- and action- representations, respectively. They are an expression of the perception-action cycle. But to study syntax, it helps to put aside semantic context, and explore how parts of speech relate to one another.

Phrases as Color Patterns

To understand syntax intuitively, start by adding color to sentences.  Then try to find patterns of color unique to well-formed sentences.

Let’s get started!

“Noun-like” groups of words appear on either side of the verb. Let noun phrase (NP) denote such a group. Optional parts of speech are indicated by the parentheses. Thus, our grammar contains the following rules:

1. S → NP V NP
2. NP → (D) (A) N

These rules explain why the following sentences feel malformed:

• “Chase dogs cats” (violates rule 1)
• “Old some dogs chase cats” (violates rule 2)

But these rules don’t capture regularities in how verbs are expressed. Consider the following sentences:

A verb phrase contains a verb, optionally followed by a noun, and/or a preposition.

1. S → NP VP
2. NP → (D) (A) N
3. VP → V (NP) (P NP)

This is better. Did you notice how we improved our sentence (S) rule? 🙂 Subject-only sentences (e.g. “She ran”) are now recognized as legal.

Prepositions are not limited to verb phrases, though. They also occur in noun phrases. Consider the following:

Prepositions are sometimes “attached to” a noun phrase. We express these as a prepositional phrase, which includes a preposition (e.g. “on”) and an optional noun phrase (e.g. “the table”).

1. S → NP VP
2. NP → (D) (A) N (PP)
3. VP → V (NP) (PP)
4. PP → P (NP)

Notice how we cleaned up the VP rule, and improved the NP rule.

Congratulations! You have discovered the rules of English. Of course, a perfectly complete grammar must include determiners (e.g., “yours”), conjunction (e.g., “and”), interjection (e.g., “wow!”). But these are fairly straightforward extensions to the above system.

These grammatical rules need not only interest English speakers. As we will see later, a variant of these rules appear in all known human languages. This remarkable finding is known as universal grammar. Language acquisition is not about reconstructing syntax rules from scratch. Rather, it is about learning the parameters by which your particular natural language (English, Chinese, Egyptian) varies from the universal script.

From Rules to Trees

Our four rules are polymorphic: they permit more than one kind of structure. Unique rule sets are easier to analyze, so let’s translate our rules into this format:

Importantly, we can conceive of these unique rules as directions to construct a tree. We can conceive of the sentence “Dogs chase cats” as:

Sentences are trees. These trees are not merely used to verify whether grammatical correctness. They play a role in speech production: which transforms the language of thought (Mentalese) to natural language (e.g., English). For more on this, see my discussion of the Tripartite Mind.

How can (massively parallel) conscious thought be made into (painfully serial) speech utterances? With syntax! Simply take the concepts you desire to communicate, and construct a tree based on (a common set of) syntactical rules.

Tree construction provides much more clarity on the phenomena of wordplay (linguistic ambiguity). Consider the sentence “I shot a wolf in my pajamas”. Was the gun fired while you were wearing pajamas? Or was the wolf dressed in pajamas?

Both interpretations agree on parts of speech (colors). It is the higher-order structure that admits multiple choices. In practice, semantics constrain syntax: we tend to select the interpretation is feels the most intuitive.

The Sociology of Linguistics

The above presentation uses a simple grammar, for pedagogic reasons. I will at some point explain the popular X’ theory (pronounced “X bar”), which explores similarities between different phrase structures (e.g., NP vs PP). Indeed, there is a wide swathe of possible grammars that we will explore.

Generative grammar is part of the Symbolist tribe of machine learning. As such, this field has rich connections with algebra, production systems, and logic. For example, propositional logic was designed as the logic of sentences; predicate logic is the logic of phrases.

Other tribes besides the Symbolists care about language and grammar, of course. Natural Language Processing (NLP) and computational linguistics have been heavily influenced by the Bayesian tribe, and use probabilitic grammars (i.e., PCFGs).

More recently, the Connectionist tribe (and deep learning technologies) are taking a swing at producing language. In fact, I suspect neural network interpretability will only be achieved once a Connectionist account of language production has matured.

Takeaways

• Language can be understood via syntax (structure) and semantics (meaning).
• Syntax requires delineating parts of speech (e.g., nouns vs verbs).
• Parts of speech occur in patterns called phrases. We can express these patterns as the rules of syntax.
• Sentences are trees. Syntax rules are instructions for tree construction.
• Sentence-trees provide insight into problems like sentence ambiguity.

For more resources on syntax trees, I recommend this lecture, this website, and this Youtube channel.

Until next time.

# An Introduction to Probability Theory

Part Of: Statistics sequence
Related To: An Introduction to Set Theory
Content Summary: 400 words, 4 min read.

“Probability theory is nothing but common sense reduced to calculation.” – Laplace

Introducing Probability Theory

Probability theory, as formulated by Andrey Kolmogorov in 1925, has two ingredients:

1. A space which define the mathematical objects (“the nouns”)
2. Axioms which define the mathematical operations (“the verbs”)

A probability space is a 3-tuple (Ω,𝓕,P):

1. Sample Space (Ω): A set of possible outcomes, from one or more events. Outcomes in Ω must be mutually exclusive and collectively exhaustive.
2. σ-Algebra (𝓕). A collection of event groupings, or subsets. If Ω is countable, this can simply be the power set, otherwise a Borel algebra is often used.
3. Probability Measure Function (P). A real-valued function P: Ω → ℝ which maps from events to real numbers.

The Kolmogorov axioms provide “rules of behavior” for the residents of probability space:

1. Non-negativity: probabilities can never be negative, P(x) >= 0.
2. Unitarity: the sum of all probabilities is 1.0 (“something has to happen”)
3. Sigma Additivity: the probability of composite events equals the sum of their individual probabilities.

Random Variables

A random variable is a real-valued function X: Ω → ℝ. A random variable is a function, but not a probability function. Rather, instantiating random variables X = x defines a subset of events ⍵ ∈ Ω such that X(⍵) = x. Thus x picks out the preimage of Ω. Variable instantiation thus provides a language to select groups of events from Ω.

Random variables with discrete outcomes (countably finite Ω) are known as discrete random variable. We can define probability mass functions (PMFs) such that

$f_X(x) = P(X=x) = P( { \omega \in \Omega : X(\omega) = x } )$

In contrast, continuous random variables have continuous outcomes (uncountable Ω). For this class of variable, the probability of any particular event is undefined. Instead, we must define probabilities against a particular interval. The probability of 5.0000000… inches of snow is 0%; it is more meaningful to discuss the probability of 5 ± 0.5 inches of snowfall. Thus, we define probability density functions (PDFs) such that:

$P[a \leq X \leq b] = \int f_X(x) dx$

We can summarize discrete PMFs and continuous PDFs in the following graphic:

Marginal Probabilities

Consider two random variables, A and B ∈ Ω. Several operators may act on these variables, which parallel similar devices in Boolean algebra and set theory.

Suppose we want to know the probability of either A or B occurring. For this, we rely on the Set Combination Theorem:

Union involves subtracting the intersection; else the purple region is counted twice. In our post on set theory, we saw this same idea expressed as the inclusion-exclusion principle (Definition 13).

Summary

This first post in a two part explored the first six concepts or probability theory. Next time, we will learn about concepts 7-12.

These definitions and theorems are the cornerstone upon which much reasoning are built. It pays to learn them well.

Related Work

# Codes and Communication

Part Of: Information Theory sequence
Content Summary: 1000 words, 10 min read

History of Communication Systems

Arguably, three pillars of modernity are: industrialization, democratic government, and communication technology. Today, we examine the latter.

Before 1860, long-distance communication required travel. This made communication across large nations quite challenging. Consider, for example, the continental United States. In 1841, it took four months for the news of the death of President Harrison to reach Los Angeles.

The Pony Express (a mail service built on horsepower) improved wait times to ten days. But it was the telegraph that changed the game. The key idea was to send messages on paper, but rather through voltage spikes in electric cables. Electrical pulses travel at near the speed of light.

In 1861, the first transcontinental cable was complete, and instantaneous communication became possible. The Pony Express closed its doors two days later.

It is hard to understate the impact of this technology. These advances greatly promoted information sharing, economic development, and improved governance.

By 1891, thousands of miles of cable had been lain underwater. These pipelines have only become more numerous and powerful over the years. Without them, the Internet would simply be impossible.

Today, we strive to understand the maths of communication.

Understanding Communication

What is communication? The transmission of linguistic information.

What is language? A shared system of reference communicated through symbols.

References (e.g., words) are functions that maps itself to an aspect of the physical world. References can denote both objects and actions.

Consider the power set of symbols (all possible combinations of letters). Words represent a subset of this object (a family of sets over an alphabet).

Symbol recognition is medium independent. For example, a word can be expressed either through writing (graphemes) or spoken language (phonemes).

References are the basis of memory. They together build representations of the physical world.

All complex nervous systems construct references. Some animals can communicate (share references). Only humans do so robustly, via syntax.

Semantic interpretations are not restricted to biology. Computers can refer as well. Reference is made possible by symbol grounding.

As the substrate of reference, symbols are the basis of computation. All answerable questions can be solved by a Turing machine.

Semantic aspects of communication are irrelevant to the engineering problem. Coding theory studies symbol sets (alphabets) directly.

Comparing Alphabets

How to compare languages? Let’s find out!

There are 26 symbols in the English alphabet. How many possible three-letter words are there? The answer is 26^3 = 17,576 possible words. More generally:

Possible Messages (M) = Alphabet Size (a) ^ Number of Symbols (X)

M = aX

Log(M) = Loga(X)

Information is the selection of specific words (“red”) from the space of possible words.

We might be tempted to associate information with W. But we desire information to scale linearly with length. Two books should contain twice as much information as one. So we say information is log(M).

I(X, a) = Loga(X)

Alphabet size (logarithmic base) is not very important in this function. Suppose we choose some other base B instead. We can compare alphabets by converting logarithmic base.

Base Conversion: Logb(X) = Loga(X) / Loga(b)

I(X, a) = Loga(X) = Logb(a) * Logb(X)

I(X) = K Logb(X) where K equals Logb(a)

I(X) is known as Shannon information.

We can compare the expressive power of different alphabets. The modern Hawaiian alphabet, for example, has 13 letters. So there are only 13^3 = 2,197 possible three-letter Hawaiian words. The information provided by these respective languages is:

I(Xhawaiian) = Log13(X)

I(Xenglish) = Log13(26) * Log13(X)

I(Xenglish) / I(Xhawaiian) = Log13(26) = 1.270238

We expect English words to be 27% more information than Hawaiian, on average. And indeed, this is precisely what we find:

With 3 English letters: 26^3 = 17,576 possible words
With 3.81 Hawaiian letters: 13^(3*1.270238) = 17,576 possible words

Translating Between Codes

How does one translate between languages? Consider the word “red”. In Hawaiian, this word is “ula’ula”. We might construct the following function:

• r → ula’
• e → ul
• d → a

But this fails to generalize. The Hawaiian word for rice is “laiki”, which does not begin with a ‘u’.

In general, for natural languages any function f: AE → AH is impossible. Why? Because words (references) map to physical reality in arbitrary ways. Two natural languages are too semantically constrained to afford a simple alphabet-based translation.

Alphabet-based translations are possible, however, if you use a thin language. Thin languages only refer when converted back into its host language. Binary is a classic example of a thin language. It has the smallest possible alphabet (size two).

An encoding is a function of type f: AE → AH. For an example, consider ASCII. This simple encoding is at the root of most modern technologies (including UTF-8, which you are using to view this webpage):

Noise and Discriminability

A communication system has five components: source, transmitter, channel, receiver, and destination.

source and destination typically share a common system of reference. Imagine two people with the same interpretation of the word “red”, or two computers with the same interpretation of the instruction “lb” (load byte).

Transmitter and receiver also tend to play reciprocal roles. Information is exchanged through the channel (e.g., sound waves, cable).

Receivers reconstruct symbols from the physical medium. Noise causes decoding errors.

How can the transmitter protect the message from error? By maximizing the physical differences between symbols. This is the discriminability principle.

This principle explains why binary is employed by computers and telecommunications. A smaller alphabet improves symbol discriminability, which combats the effect of noise.

Takeaways

• Language is a shared system of reference communicated through symbols
• References are functions that maps itself to an aspect of the physical world.
• Symbol recognition is medium independent
• Alphabet size determines expressive power (how many messages are possible)
• An encoding lets you alter (often reduce) language’s alphabet.
• Such encodings are often desirable because they protect messages from noise.

# An Introduction To Energy

Part Of: Demystifying Physics sequence
Content Summary: 700 words, 7min reading time.

Energy As Universal Currency

Why does burning gasoline allow a car to move? Chemical reactions and kinetic propulsion seem quite distinct.

How does a magnet pull a nail from the ground? What relation exists between magnetism and gravitational pull?

What must occur for a nuclear reactor to illuminate a light bulb? What connection is there between nuclear physics and light waves?

Energy is the hypothesis of a hidden commonality among the above phenomena. There are many forms of energy: kinetic, electric, chemical, gravitational, magnetic, radiant. But these forms are expressions of a single underlying phenomena.

A single object may possess many different energy forms simultaneously:

A block of wood thrown into the air will possess kinetic energy because of its motion, gravitational potential energy because of its height above the ground, chemical energy in the wood (which can be burned), heat energy depending on its temperature, and nuclear energy in its atoms (this form is not readily available from our block of wood, but the other forms may be).

Non-physicists worry that physics involves memorizing a giant catalogue of phenomena, each discovered by some guy who got an equation named after him. Energy is the reason why physics is not “stamp collecting”. It allows us to seamlessly switch between different phenomena.

Change As Energy Transformation

The only thing that is constant is change.
-Heraclitus

Certain Greek philosophers were obsessed with change. Physics gives us a language that formalizes these intuitions. Consider the following.

• Energy is the capacity to do work; that is, initiate processes.
• Processes (i.e., work) involve continuous and controlled actions, or changes.

A confusing diversity of phenomena can produce force (the ability to accelerate things), but they all share the same blood. Change is energy transformation.

We can play “where does the energy come from” game on literally any event in the physical universe:

A Worked Example

Let’s get concrete, and go through a simple illustration of work as energy transformation.

Our example involves a ball sliding down an incline.

To get our bearings, let’s calculate the acceleration experienced by the ball

To demonstrate conservation of energy, we first need to solve for the ball’s final velocity (v) at the bottom of the ramp. Recall the lesson of kinematic calculus, that displacement, velocity, and acceleration are intimately related:

$x(t) = \int v(t) = \int \int a(t)$

$a(t) = v'(t) = x''(t)$

We can use these formulae to calculate final velocity (for a tutorial see here).

So the ball’s final velocity will be $6.26$ meters per second $\sqrt{4g}$ m/s). However, recall the classical definitions of kinetic and potential (gravitational) energy, which are $KE = \frac{1}{2}mv^2$ and $PE = mgh$.

If conservation of energy is true, then we should expect the following to hold:

$(KE + PE)_{final} - (KE + PE)_{initial} = 0$

Is this in fact the case? Yes!

$m[(0 + 2g) - (\frac{ \sqrt{(4g)^2} }{2})] + 0 = m[2g - 2g] = 0$

Total energy of the ball stays the same across these two points. Conservation of energy can also be demonstrated at any other time in the ball’s journey. In fact, we can show that potential energy is smoothly converted to kinetic energy (image credit Physics Classroom).

Why is the ball transforming its energy $KE \rightarrow PE$? Because it is experiencing a force.

Cosmological Interpretation

Just as space and time are expressions of a single spacetime fabric, Einstein also demonstrated mass-energy equivalence. Mass is simply a condensed form of energy, capable of being released in e.g., nuclear explosions. Do not confuse mass and matter. Mass and energy are (interchangeable) properties of matter.

So there are two components of reality: energymass and spacetime. Spacetime bends for energymass. Energy is conserved, but has many faces. The laws of physics (quantum field theory and general relativity) describe the interactions between energymass and spacetime. And since we know that energy is conserved, all there is today was all there was at the beginning of time.

If the amount of energy contained in our universe doesn’t change, what is this quantity? How much energy is there? One strong candidate theory is zero. The flat universe hypothesis rests on the realization that gravitational energy is negative, and claims that it counterbalances all other mediums.

Next time, we will explore the distinction between usable vs inaccessible energy. Until then.

# An Introduction to Prospect Theory

Part Of: [Neuroeconomics] sequence
Content Summary: 1500 words, 15 min reading time

Preliminaries

Decisions are bridges between perception and action. Not all decisions are cognitive. Instead, they occur at all levels of the abstraction hierarchy, and include things like reflexes.

Theories of decision tend to constrain themselves to cognitive phenomena. They come in two flavors: descriptive (“how does it happen”) and normative (“how should it happen”).

Decision making often occur in the context of imperfect knowledge. We may use probability theory as a language to reason about uncertainty.

Let risk denote variance in the probability distribution of possible outcomes. Risk can exist regardless of whether a potential loss is involved. For example, a prospect that offers a 50-50 chance of paying $100 or nothing is more risky than a prospect that offers$50 for sure – even though the risky prospect entails no possibility of losing money.

Today, we will explore the history of decision theory, and the emergence of prospect theory. As the cornerstone of behavioral economics, prospect theory provides an important theoretical surface to the emerging discipline of neuroeconomics.

Maximizing Profit with Expected Value

Decision theories date back to the 17th century, and a correspondence between Pascal and Fermat. There, consumers were expected to maximize expected value (EV), which is defined as probability p multiplied by outcome value x.

$EV = px$

To illustrate, consider the following lottery tickets:

Suppose each ticket costs 50 cents, and you have one million dollars to spend. Crucially, it doesn’t matter which ticket you buy! Each of these tickets have the same expected value: $1. Thus, it doesn’t matter if you spend the million dollars on A, B, or C – each leads to the same amount of profit. The above tickets have equal expected value, but they do not have equal risk. We call people who prefer choice A risk averse; whereas someone who prefers C is risk seeking. Introducing Expected Utility Economic transactions can be difficult to evaluate. When trading an apple for an orange, which is more valuable? That depends on a person’s unique tastes. In other words, value is subjective. Let utility represent subjective value. We can treat utility as a function u() that operates on objective outcome x. Expected utility, then, is highly analogous to expected value: $EU = pu(x)$ Most economists treat utility functions as abstractions: people act as if motivated by a utility function. Neuroeconomic research, however, suggests that utility functions are physically constructed by the brain. Every person’s utility function may be different. If a person’s utility curve is linear, then expected utility converges onto expected value: $EU \rightarrow EV \mid u(x) = x$ Recall in the above lottery, the behavioral distinction between risk-seeking (preferring ticket A) and risk-averse (preferring C). Well, in practice most people prefer A. Why? We can explain this behave by appealing to the shape of the utility curve! Utility convexity produces risk aversion: In the above, we see the first$50 (first vertical line) produces more utility (first horizontal line) than the second $50. Intuitively, the first$50 is needed more than the second $50. The larger your wealth, the less your need. This phenomenon is known as diminishing marginal returns. Neoclassical Economics In 1947, von Neumann and Morgenstern formulated a set of axioms that are both necessary and sufficient for representing a decision-maker’s choices by the maximization of expected utility. Specifically, if you assume an agent’s preference set accomodates these axioms… 1. Completeness. People have preferences over all lotteries. $\forall L_1, L_2 \in L$ either $L_1 \leq L_2$ or $L_1 \geq L_1$ or $L_1 = L_2$ 2. Transitivity. Preferences are expressed consistently. $\forall L_1, L_2, L_3 \in L$ if $L_1 \leq L_2$ and $L_1 \leq L_2$ then $L_1 \leq L_3$ 3. Continuity. Preferences are expressed as probabilities. $L_1, L_2, L_3 \in L$ then $\exists \alpha, B$ s.t. $L_1 \geq L_2 \geq L_3$ iff $\alpha L_1 + (1-\alpha)L_3 \geq L_2 \geq BL_1 + (1 - B)L_3$ 4. Independence of Irrelevant Alternatives (IIA). Binary preferences don’t change by injecting a third lottery. … then those preferences always maximize expected utility. $L_1 \geq L_2$ iff $sum(p_1u(x_1) \geq p_2u(x_2)$ The above axioms constitute expected utility theory, and form the cornerstone for neoclassical economics. Expected utility theory bills itself as both a normative and descriptive theory: that we understand human decision making, and have a language to explain why it is correct. Challenges To Independence Axiom In the 1970s, expected utility theory came under heavy fire for failing to predict human behavior. The emerging school of behavioral economics gathered empirical evidence that Neumann-Morgenstern axioms were routinely violated in practice, especially the Independence Axiom (IIA). For example, the Allais paradox asks our preferences for the following choices: Most people prefer A (“certain win”) and D (“bigger number”). But these preferences are inconsistent, because C = 0.01A and D = 0.01B. The independence axiom instead predicts that A ≽ B if and only if C ≽ D. The Decoy effect is best illustrated with popcorn: Towards a Value Function Concurrently to these criticisms of the independence axiom, the heuristics and biases literature (led by Kahneman and Tversky) began to discover new behaviors that demanded explanation: • Risk Aversion. In most decisions, people tend to prefer smaller variance in outcomes. • Everyone prefers gains over losses, of course. Loss Aversion reflects that losses are felt more intensely than gains of equal magnitude. • The Endowment Effect. Things you own are intrinsically valued more highly. Framing decisions as gains or as losses affects choice behavior. Each of these behavioral findings violate the Independence Axiom (IIA), and cumulatively demanded a new theory. And in 1979, Kahneman and Tversky put forward prospect theory to explain all of the above effects. Their biggest innovation was to rethink the utility function. Do you recall how neoclassical economics appealed to $u(x)$ convexity to explain risk aversion? Prospect theory takes this approach yet further, and seeks to explain all of the above behaviors using a more complex shape of the utility function. Let value function $\textbf{v(x)}$ represent our updated notion of utility. We can define expected prospect $\textbf{EP}$ of a function as probability multiplied by the value function $EP = pv(x)$ Terminology aside, each theory only differs in the shape of its outcome function. Let us now look closer at the the shape of $v(x)$: This shape allows us to explain the above behaviors: The endowment effect captures the fact that we value things we own more highly. The reference point in $v(x)$, where $x = 0$, captures the status quo. Thus, the reference point allows us to differentiate gains and losses, thereby producing the endowment effect. Loss aversion captures the fact that losses are felt more strongly than gains. The magnitude of $v(x)$ is larger in the losses dimension. This asymmetry explains loss aversion. We have already explained risk aversion by concavity of the utility function $u(x)$. $v(x)$ retains convexity for material gains. Thus, we have retained our ability to explain risk aversion in situations of possible gains. For losses, $v(x)$ concavity predicts risk seeking. Towards a Weight Function Another behavioral discovery, however, immediately put prospect theory in doubt: • The Fourfold Pattern. For situations that involve very high or very low probabilities, participants often switch their approaches to risk. To be specific, here are the four situations and their resultant behaviors: 1. Fear of Disappointment. With a 95% chance to win$100, most people are risk averse.
2. Hope To Avoid Loss. With a 95% chance to lose $100, most people are risk seeking. 3. Hope Of Large Gain. With a 5% chance to win$100, most people are risk seeking.
4. Fear of Large Loss. With a 5% chance to lose $100, most people are risk averse. Crucially, $v(x)$ fails to predict this behavior. As we saw in the previous section, it predicts risk aversion for gains, and risk seeking for losses: Failed predictions are not a death knell to a theory. Under certain conditions, they can inspire a theory to become stronger! Prospect theory was improved by incorporating a more flexible weight function. $EP = pv(x) \rightarrow EP = w(p)v(x)$ Where $w(p)$ has the following shape: These are in fact two weight functions: 1. Explicit weights represent probabilities learned through language; e.g., when reading the sentence “there is a 5% chance of reward”. 2. Implicit weights represent probabilities learned through experience, e.g., when the last 5 out of 100 trials yielded a reward. This change adds some mathematical muscle to the ancient proverb: Humans don’t handle extreme probabilities well. And indeed, the explicit weight function successfully recovers the fourfold pattern: Takeaways Today we have reviewed theories of expected value, expected utility (neoclassical economics), and prospect theory. Each theory corresponds to a particular set of conceptual commitments, as well a particular formula: $EV = px$ $EU = pu(x)$ $EP = w(p)v(x)$ However, we can unify these into a single value formula V: $V = w(p)v(x)$ In this light, EV and EU have the same structure as prospect theory. Prospect theory distinguishes itself by using empirically motivated shapes: With these tools, prospect theory successfully recovers a wide swathe of economic behaviors. Until next time. # Markov Decision Processes Part Of: Reinforcement Learning sequence Followup To: An Introduction To Markov Chains Content Summary: 900 words, 9 min read Motivations Today, we turn our gaze to Markov Decision Processes (MDPs), a decision-making environment which supports our propensity to learn from good and bad outcomes. We represent outcome desirability with a single number, R. This value is used to refine action selection: given a particular situation, what action will maximize expected reward? In biology, we can describe the primary work performed by an organism is to maintain homeostasis: maintaining metabolic energy reserves, body temperature, etc in a widely varying world. Cybernetics provide a clear way of conceptualizing biological reward. In Neuroendocrine Integration, we discussed how brains must respond both to internal and external changes. This dichotomy expresses itself as two perception-action loops: a visceral body-oriented loop, and a cognitive world-centered one. Rewards are computed by the visceral loop. To a first approximation, reward encode progress towards homeostasis. Food is perceived as more rewarding when the body is hungry, this is known as alliesthesia. Reward information is delivered to the cognitive loop, which helps refine its decision making. Extending Markov Chains Recall that a Markov Chain contains a set of states S, and a transition model P. A Markov Decision Process (MDP) extends this device, by adding three new elements. Specifically, an MDP is a 5-tuple (S, P, A, R, ɣ): • A set of states s ∈ S • A transition model Pa(s’ | s). • A set of actions a ∈ A • A reward function R(s, s’) • A discount factor ɣ To illustrate, consider GridWorld. In this example, every location in this two-dimensional grid is a state, for example (1,0). State (3,0) is a desirable location: R(s(3,0)) = +1.0, but state (3,1) is undesirable, R(s(3,1)) = -1.0. All other states are neutral. Gridworld supports four actions, or movements: up, down, left, and right. However, locomotion is imperfect: if Up is selected, the agent will only move up with 80% probability: 20% of the time it will go left or right instead. Finally, attempting to move into a forbidden square will simply return the agent to its original location (“hitting the wall”). The core problem of MDPs is to find a policy (π), a function that specifies the agent’s response to all possible states. In general, policies should strive to maximize reward, e.g., something like this: Why is the policy at (2,2) Left instead of Up? Because (2,1) is dangerous: despite selecting Up, there is a 10% chance that the agent will accidentally move Right, and be punished. Let’s now consider an environment with only three states A, B, and C. First, notice how different policies change the resultant Markov Chain: This observation is important. Policy determines the transition model. Towards Policy Valuation V(s) An agent seeks to maximize reward. But what does that mean, exactly? Imagine an agent selects 𝝅1. Given the resultant Markov Chain, we already know how to use matrix multiplication to predict future locations St. The predicted reward Pt is simply the dot product of expected location and the reward function. $P_t = S_t \cdot R$ We might be tempted to define the value function V(S) as the sum of all predicted future rewards: $V_O(S) = P_0 + P_1 + P_2 + P_3 + \dots = \sum{P_k}$ However, this approach is flawed. Animals value temporal proximity: all else equal, we prefer to obtain rewards quickly. This is temporal discounting: as rewards are further removed from the present, their value is discounted. In reinforcement learning, we implement temporal discounting with the gamma parameter: rewards that are k timesteps away are multiplied by the exponential discount factor $\gamma^k$. The value function becomes: $V_O(S) = P_0 + \gamma P_1 + \gamma^2 P_2 + \gamma^3 P_3 + \dots = \sum{\gamma^k P_k}$ Without temporal discounting, V(s) can approach infinity. But exponential discounting ensures V(s) equals a finite valueFinite valuations promote easier computation and comparison of state evaluations. For more on temporal discounting, and an alternative to the RL approach, see An Introduction to Hyperbolic Discounting. Intertemporal Consistency In our example, at time zero our agent starts in state A. We have already used linear algebra to compute our Pk predictions. To calculate value, we simply compute$latex \sum{\gamma^k P_k}\$

$V_0(A) = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3$

Agents compute V(s) at every time step. At t=1, two valuations are relevant:

$V_1(A) = 0 + 0 + 0.64 \gamma^2 + \dots$

$V_1(B) = 0 + 0.8 \gamma + 0.96 \gamma^2 + \dots$

What is the relationship between the value functions at t=0 and t=1? To answer this, we need to multiply each term by $\gamma P(X|A)$, where $X$ is the state being considered at the next time step.

$W_1(A) \triangleq \gamma 0.2 V_1(A)$

$W_1(A) = 0 + 0 + (0.2)(0.64)\gamma^3 + \dots$

Similarly,

$W_1(B) \triangleq \gamma P(B|A)V_1(B) = \gamma 0.8 V_1(B)$

$W_1(B) 0 + (0.8)(0.8) \gamma^2 + (0.8)(0.96) \gamma^3 + \dots$

Critically, consider the sum $X = r_0(s) + W_1(A) + W_1(B)$:

$X = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3 + \dots$

Does $X_0$ look familiar? That’s because it equals $V_0(A)$! In this way, we have a way of equating a valuation at t=0 and t=1. This property is known as intertemporal consistency.

Bellman Equation

We have seen that $V_0(A) = X_0$. Let’s flesh out this equation, and generalize to time t.

$V_t(s) = r_t(A) + \gamma \sum{P(s'|s)V_{t+1}(s')}$

This is the Bellman Equation, and it is a central fixture in control systems. At its heart, we define value in terms of both immediate reward and future predicted value. We thereby break up a complex problem into small subproblems, a key optimization technique that can be approached with dynamic programming.

Next time, we will explore how reinforcement learning uses the Bellman Equation to learn strategies with which to engage its environment (the optimal policy 𝝅). See you then!

# An Introduction To Markov Chains

Part of: Reinforcement Learning sequence
Related to: An Introduction to Linear Algebra
Content Summary: 700 words, 7 min read

Motivation

We begin with an example.

Suppose a credit union classifies automobile loans into four categories: Paid in Full (F), Good Standing (G), Behind Schedule (B), Collections (C).

Past records indicate that each month, accounts in good standing change as follows: 10% pay the loan in full, 10% fall behind on payments, 80% remain in good standing.

Similarly, bad loans historically change every month as follows: 10% are paid in full, 40% return to good standing, 40% remain behind schedule, 10% are sent to collection.

A Markov Chain allows us to express such situations graphically:

Loan statuses are nodes, transition probabilities are arrows.

Markov Property

Formally, a Markov Chain is a tuple (S, P)

• A set of states s ∈ S
• A transition model P(s’ | s).

At the core of Markov Chains is the Markov Property, which states (for time t = n):

$P(s_n | s_{n-1}, s_{n-2}, \dots, s_0) = P(s_n | P(s_{n-1})$

This is a statement of conditional independence. If I tell you the history of all prior states, and ask you to predict the next time step, you can forget everything except the present state. Informally, a complete description of the present screens off any influence of the past. Thus, the Markov Property ensures a kind of “forgetful” system.

Any model which relies on the Markov Property is a Markov Model. Markov models represent an important pillar in the field of artificial intelligence. Three extensions of Markov Chains are particularly important:

State Expectations by Unweaving

Let’s imagine a Markov Chain with three states: A, B, and C.  If you begin at A, where should you expect to reside in the future?

An intuitive way to approach this question is to”unweave” the Markov Chain, as follows:

Each branch in this tree represents a possible world. For example, at t1 there is a 20% chance the state will be A, and an 80% chance the state will be B. Computing expected locations for subsequent timesteps becomes straightforward enough. At t2, we see that:

• There is an (0.2)(0.2) = 4% chance of residing in A.
• There is an (0.8)(0.2) + (0.2)(0.8) = 32% chance of residing in B.
• There is an (0.8)(0.8) = 64% chance of residing in C.

The above computations can be expressed with a simple formula:

$S_t(s) = \sum_{paths}\prod_{edges} P(s|s')$

However, these computations become tedious rather quickly. Consider, for example  S3(C):

State Expectations By Linear Algebra

Is there a way to simplify the maths of expectation?

Yes, by approaching  Markov Chains through the lens of linear algebra. Conditional probabilities are encoded as transition matrices, as follows:

This representation enables computation of expected location by matrix multiplication:

$S_{t+1} = S_t * T$

We compute expectation timesteps sequentially.  By defining a base case and an inductive step, this process qualifies as mathematical induction.

As you can see, these maths are equivalent: S3(C) = 0.896 in both cases.

In the above example, C is called an absorbing state. As time goes to infinity, the agent becomes increasingly likely to reside in state C.  That is, Sn = [0 0 1] as n→∞. This finding generalizes. Every Markov Chain that contains a (reachable) absorbing state converges on a distribution in the limit, or limiting distribution.

Can we discover the limiting distribution?

Yes, with the following recipe. First, convert the transition matrix into standard form. Second, apply matrix multiplication and inversion to derive the fundamental and limiting matrix. Last, use these matrices to answer real-world questions about our data:

Let me illustrative with our automotive loan example. First, we prepare our data.

With T in standard form, we compute F = (I – Q)-1 and T’.

Now that we know F and T’, we are in a position to answer questions with our data.

Thus, we are able to predict how Markov Chains will behave over the long run.

Takeaways

• Markov Chains are convenient ways of expressing conditional probabilities graphically
• But they require the Markov Property, that knowledge the present screens off any influence of the past.
• We can compute expected locations by reasoning graphically.
• However, it is simpler to compute expected locations by linear algebra techniques.
• Linear algebra also enables us to discover what (some) Markov chains will approach, their limiting distribution.

Further Resources

• To gain more intuition with linear algebra, see here.
• To see Markov Chains applied to real-world situations, see here.
• To see steady-state computations worked out in more detail, see here.

# An Introduction To Primate Societies

Part Of: Anthropogeny sequence
Content Summary: 900 words, 9min read

Introduction

Primates are relatively young branch of the mammalian clade. Their anatomical characteristics are as follows:

There are three kinds of primate: prosimians (e.g., lemurs), monkeys (e.g., macaques), and apes (e.g., humans).

Primates are known for their large brains and a social lifestyle. Today, we will explore the dynamics of primate societies (defined as frequently interacting members of the same species).

There are three components of any society: the mating system (including sexual dynamics), the social organization (spatiotemporal organization of interaction), and the social structure (relational dynamics & social roles).

Sexual Dynamics

Because DNA is creepy, it programs bodies to make more copies of itself. Men and women are programmed with equally strong imperatives for gene replication (reproductive success). But female pregnancy powerfully breaks the symmetry:

• Women spend more metabolic & temporal resources rearing children.
• Women are certain that their offspring is their own, men can experience ambiguity.
• A single woman can only produce one child at a time, a single man can impregnate many women concurrently.

It is because of pregnancy that males court females, and females choose males.

For females, paternal care is of tantamount importance: finding a mate willing to share the burden of raising a child. For males, fecundity is key.

We can see echoes of this asymmetry today. In all human cultures observed,

• Women tend to be more jealous of emotional infidelity. Men have more violent reactions to sexual infidelity.
• Women are statistically more interested in male social status and resources. Men pay comparatively more attention to physical beauty.

These gender differences arise as a response to the biological mechanism of pregnancy.  These are contingent facts, nothing more. Species with male gestation, such as the seahorse, witness the reversal of such “gender roles”.

Four Mating Systems

From a logical perspective, there are exactly four possible mating systems.

Which mating system is biologically preferable? That depends on your gender:

• Females benefit from polyandry, with multiple males available to raise offspring.
• Males maximize their genetic impact with polygyny.

Most primates are polygynous. Why?

The answer is geographic. To survive, an animal must travel to surrounding land, locating flora or fauna to satisfy its metabolic budget. The amount of land it covers is known as its territory. The more fertile the land, the smaller the territory (less need to travel).

To mate with a female, a male will – of course – enter into that female’s territory. Thus, we can visualize each mating system from the lens of territory:

Mating systems are determined by female territory size.

• If males can encompass the territories of multiple females, males will select polygyny (or, more rarely, promiscuity).
• Otherwise, if females do not live in defensible groups, males will typically revert to monogamy (or, if females are sparse, polyandry).

In turn, female territory size is determined by environmental conditions. If the terrain is sparse, a female must travel further to sustain itself, and vice versa.

Our causal chain goes: plentiful land → smaller female territory size → polygyny. This is the Environmental Potential for Polygyny.

Three Social Organizations

The vast majority of primates are group living: they forage & sleep with bisexual groups of at least three adults. They spend most of their waking lives in the presence of one another. In other mammals, such group living is much less common.

Primates (e.g., humans) did not originally choose to live in groups because of their sociality. Predation risk induced group living. Only afterwards did primate brains adapt to this new lifestyle.

Some primates are exceptions to this rule. Two other, rarer, varieties of primate social organizations exist:

Some primates are solitary, foraging on their own. These species tend to be nocturnal. With less predation risk, individuals need not share territory.

Other primates live in pair bonds, a male-female pair. The attachment system is employed by infants to attach to their mothers: monogamous primates redeploy this system to support adult commitment. That said, primate monogamy only occurs when females live in an area that is difficult to defend.

We have seen 4 mating systems, and 3 social organizations. These are not independent:

• Pair living and monogamy correlate. However, few primates live in such systems (thin line)
• Group living and polygyny correlate: both are promoted by overlapping female territories. Most primates occupy this arrangement (thick line).

Structure: Dominance Hierarchy

When animals’ territory overlaps, they often compete (fight) for access to resources (food and reproductive access).

Fighting is accompanied with risk: the stronger animal could be unlucky, the weaker animal could lose their life. Similar to human warfare, both sides suffer less when the weaker side pre-emptively surrenders. The ability to objectively predict the outcome of a fight is therefore advantageous.

Suppose the need for fight-predictions is frequent, and do not often change (physical strength changes only slowly over an animal’s life). Instead of constantly assessing physical characteristics of your opponent, it is simpler to just remember who you thought was stronger last time.

This is the origin of the dominance hierarchy. The bread and butter of dominance hierarchies is status signaling. Dominant behaviors (e.g., snarling) evokes submissive behaviors (e.g., looking away).

Takeaways

We have explored three aspects of primate societies: mating system, social organization and social structure. Each of these is driven by external, ecological factors.

Primate niches typically feature high predation risk and fertile terrain. These promote female grouping, which in turn attracts males to live with them in groups, under a polygynous mating system.

Primates are unique for successfully living in groups throughout their long lifespan. To support this ability, primate brain volume increased, and came to provide increasingly sophisticated cognitive mechanisms & social structures.

We will explore the evolution of social structure next time. See you then!

References

• Kappeler & Schaik, 2001: Evolution of Primate Social Systems.