Intro to Regularization

Part Of: Machine Learning sequence
Followup To: Bias vs Variance, Gradient Descent
Content Summary: 1100 words, 11 min read

In Intro to Gradient Descent, we discussed how loss functions allow optimization methods to locate high-performance models.

But in Bias vs Variance, we discussed how model performance isn’t the only thing that matters. Simplicity promotes generalizability.

One way to enhance simplicity is to receive the model discovered by gradient descent, and manually remove unnecessary parameters.

But we can do better. In order to automate parsimony, we can embed our preference for simplicity into the loss function itself.

But first, we need to quantify our intuitions about complexity.

Formalizing Complexity

Neural networks are often used as classification models against large numbers of images. The complexity of the models tends to correlate with the number of layers. For some models then, complexity is captured in the number of parameters.

While not used much in the industry, polynomial models are pedagogically useful examples of regression models. Here, the degree of the polynomial expresses the complexity of the model: a degree-eight polynomial has more “bumps” than a degree-two polynomial.

Consider, however, the difference between the following regression models

y_A = 4x^4 + 0.0001x^3 + 0.0007x^2 + 2.1x + 7

y_B = 4x^4 + 2.1x + 7

Model A uses five parameters; Model B uses three. But their predictions are, for all practical purposes, identical. Thus, the size of each parameter is also relevant to the question of complexity.

The above approaches rely on the model’s parameters (its “visceral organs”) to define complexity. But it is also possible to rely on the model’s outputs (its “behaviors”) to achieve the same task. Consider again the classification decision boundaries above. We can simply measure the spatial frequency (the “squiggliness” of the boundary) as another proxy towards complexity.

Here, then, are three possible criteria for complexity:

  1. Number of parameters
  2. Size of parameters
  3. Spatial frequency of decision manifold

Thus, operationalizing the definition of “complexity” is surprisingly challenging.

Mechanized Parsimony

Recall our original notion of the performance-complexity quadrant. By defining our loss function exclusively in terms of the residual error, gradient descent learns to prefer accurate models (to “move upward”). Is there a way to induce leftward movement as well?

To have gradient descent respond to both criteria, we can embed them into the loss function. One simple way to accomplish this: addition.

This technique is an example of regularization.

Depending on the application, sometimes the errors are much larger than the parameters or vice versa. In order to assure the right balance between these terms, people usually add a hyperparameter to the regularized loss function J = \|e\|_2 + \lambda \|\theta\|_2

A Geometric Interpretation

Recall Einstein’s insight that gravity is curvature of spacetime. You can envision such curvature as a ball pulling on a sheet. Here is the gravity well of bodies of the solar system:

Every mass pulls on every other mass! Despite the appearance of the above, Earth does “pull on” Saturn.

The unregularized cost function we saw last time creates a convex loss function, which we’ll interpret as a gravity well centered around parameters of best fit. If we replace J with a function that only penalizes complexity, a corresponding gravity well appears, centered around parameters of zero size.

If we keep both terms, we see the loss surface now has two enmeshed gravity wells. If scaled appropriately, the “zero attractor” will pull the most performant solution (here \theta = (8,7) towards a not-much-worse yet simpler model \theta = (4,5).

More on L1 vs L2

Previously, I introduced the L1 norm aka mean average error MAE

\|x\|_1 = (\sum_{i=1}^{n} \lvert x_i\rvert^1)^1

Another loss function is the L2 norm aka root mean squared error RMSE

\|x\|_2 = (\sum_{i=1}^{n} \lvert x_i\rvert^2)^{1/2}

The L1 and L2 norms respectively correspond to Euclidean vs Manhattan distance (roughly, plane vs car travel):

One useful way to view norms is by their isosurface. If you can travel in any direction for a finite amount of time, the isosurface is the frontier you might sketch.

The L2 isosurface is a circle. The L1 isosurface is a diamond.

  • If you don’t change direction, you can travel the “normal” L2 distance.
  • If you do change direction, your travel becomes inefficient (since “diagonal” travel along the hypotenuse is forbidden).

The Lp Norm as Superellipse

Consider again the formulae for the L1 and L2 norm. We can generalize these as special cases of the Lp norm:

\|x\|_p = (\sum_{i=1}^{n} \lvert x_i\rvert^p)^{1/p}

Here are isosurfaces of six exemplars of this norm family:

On inspection, the above image looks like a square that’s inflating with increasing p. In fact, the Lp norm generates a superellipse.

As an aside, note that the boundaries of the Lp norm family operationalize complexity rather “intuitively”. For the L0 norm, complexity is the number of non-zero parameters. For the Linf norm, complexity is the size of the largest parameter.

Lasso vs Ridge Regression

Why the detour into geometry?

Well, so far, we’ve expressed regularization as J  = \|e\|_p + \lambda \| \theta \|_p But most engineers choose between the L1 and L2 norms. The L1 norm is not convex (bowl shaped), which tends to make gradient descent more difficult. But the L1 norm is also more robust to outliers, and has other benefits.

Here are two options for the residual norm:

  • \|e\|_2: sensitive to outliers, but a stable solution
  • \|e\|_1: robust to outlier, but an unstable solution

The instability of \|e\|_1 tends to be particularly thorny in practice, so $latex \|e\|_2$ is almost always chosen.

That leaves us with two remaining choices:

  • Ridge Regression: J =  \|e\|_2  + \|\theta\|_2 : computationally inefficient, but sparse output.
  • Lasso Regression: J =  \|e\|_2  + \|\theta\|_1: computationally efficient, non-sparse output

What does sparse output mean? For a given model type, say y = ax^3 + bx^2 + cx + d with parameters (a, b, c, d), Ridge regression might output parameters (3, 0.5, 7.8, -0.4) whereas Lasso might give me (3, 0, 7.8, 0) . In effect, Ridge regression is performing feature selection: locating parameters that can be safely removed. Why should this be?

Geometry to the rescue!

In ridge regression, both gravity wells have convex isosurfaces. Their compromises are reached anywhere in the loss surface. In lasso regression, the diamond-shaped complexity isosurface tends to push compromises towards axes where \theta_i = 0. (In higher dimensions, the same geometry applies).

Both Ridge and Lasso regression are used in practice. The details of your application should influence your choice. I’ll also note in passing that “compromise algorithms” like Elastic Net exist, that tries to capture the best parts of either algorithm.

Takeaways

I hope you enjoyed this whirlwind tour of regularization. For a more detailed look at ridge vs lasso, I recommend reading this.

Until next time.

Advertisements

Deep Homology: our shared genetic toolkit

Part Of: Biology sequence
Content Summary: 1400 words, 14 min read.

Ernst Mayr once wrote that “the search for homologous genes is quite futile except in very close relatives”. But evolutionary developmental biology (aka evo devo) has turned this common knowledge on its face. All complex animals – flies and flycatchers, dinosaurs and trilobites, flatworms and humans, share a common genetic toolkit of that govern the formation and patterning of their bodies and body parts.

Let’s dive in.

Principles of Development

Bodies are not built at random. They are constrained by two important organizing principles.

  1. Bilateral symmetry. The left and right sides of our bodies tend to mirror one another.
  2. Modularity. Our genomes tend to build recurring segments, and then proceed to customize each segment.

Modularity is one of the most important principles in anatomy. In protostomes like worms and centipedes, modules are expressed in repeating segments. In deuterostomes like penguins and humans, modules are expressed in repeated somites (e.g. vertebrae).

Consider the following facts:

  • Trilobite anatomy features many identical legs. In contrast, its descendants (e.g., crayfish) have fewer, and highly specialized appendages.
  • Early teeth in e.g., sharks were numerous and undifferentiated. Contrast this with the horse, which has incisors, canines, premolars, and molars.

Williston’s Law generalizes such observations. Earlier species have many, unspecialized modular repetitions. Over time, there is a trend towards fewer, increasingly specialized parts.

Hox Genes and Localization

The genome contains coding genes, which directly encode proteins, and also regulatory genes that modify the activation profile of those coding genes. Regulatory genes form a regulatory hierarchy, whereby gene activation is controlled by increasingly specific activation profiles. The process is an abstraction hierarchy, such as feature hierarchies found in convolutional neural networks. This regulatory system learns how to e.g., deploy calcium proteins in areas where bone formation is prescribed.

Recall that every cell in an organism contains the exact same DNA. How then does one cell know to become an eye tissue, and another cell knows to become liver tissue? How is cellular differentiation possible?

In order to learn what kind of cell it is, a cell must learn where it is: differentiation requires localization. Roughly speaking, a cell will manufacture eye-specific proteins once it knows that it is located above the nose, and between the ears.

How do cells learn their position? One bit at a time. Per the intension-extension tradeoff, as cells get more location information, their localization window shrinks.

All this is nice in theory, but how does it work in practice?

Just as brain lesions shed light on neuroscience, birth defects shed light on developmental biology. Biologists have been particularly interested in homeotic mutations: mutations that cause body structure to grow in “the wrong place”. Examples include extra fingers in humans, only one central eye in sheep, and legs in the place of eyes in the fruit fly.

A closer look has revealed that homeotic mutations are caused by damage to a specific set of genes: homeobox (Hox) genes. These genes (near the top of the regulatory hierarchy) encode location information, and are conserved across species – the same genes exist in a mouse and a fruit fly:

Hox genes help explain the phenomenon of Williston’s Law (module customization). In arthropod segments, boundaries in Hox gene formation promote customizations across different segments:  

Bodybuilding & the Genetic Toolkit

In both flies and humans, the very same gene (Pax-6) orchestrates eye development, despite enormous differences in eye phenotypes. Even if you activate this gene in the wing of a fly, that wing will grow eye tissue. When the gene is deactivated, eye formation fails.  And if you transplant the fly’s Pax-6 gene into a eyeless mouse, that mouse will regain the ability to grow its eyes.

The Pax-6 gene is an example of a master bodybuilder gene. Here are two other examples from this category

  1. The DLL “Distal-Less” gene builds appendages. In chickens it builds legs; fins in fish, siphons in sea squirts, and tube feet in sea urchins.
  2. The NK2 “Tinman” gene contributes to the circulatory system. It orchestrates heart development across many different phyla.

There is more to the story than just Hox and bodybuilder genes. Other “master” genes are shared across all animal phyla. These include genes for hormones, those that regulate cell type, those involved in signaling pathways, coloration, receptor mechanisms, and other DNA binding use cases. Together, these genes comprise the genetic toolkit: a set of genes responsible for the development of multicellular organisms.

Explaining the Cambrian Explosion

Often two different species will have a feature in common.  Such facts can be explained in two different ways.

  1. Homology: the feature is shared because it was invented in a common ancestor of both species.
  2. Homoplasy (aka analogy, or convergent evolution): the feature was not derived from a common ancestor; it was invented separately and independently

One example of homology is having four limbs: our tetrapod ancestors were the first to try this new body plan. In contrast, the evolution of wings in birds and bats is an example of homoplasy – the common ancestor of these species was terrestrial. This example is nicely illustrated in a phylogeny:

Protostomes and deuterostomes use the very same Hox and bodybuilder genes. It is very unlikely that the exact same genetic toolkit was constructed twice. The most parsimonious explanation is homology. Their common ancestor, a bilateral symmetric population Urbilateria, also possessed this genetic toolkit. Specifically, we can safely conclude that Urbilateria had a toolkit if at least six or seven Hox genes, Pax-6, Distal-Less, Tinman, and a few hundred more bodybuilding genes.

Urbilaterians have not yet been found in the fossil record. However, we can do better than envisioning some featureless worm. We can use our knowledge of their genome to infer their body plans.

Because Pax-6 resides in both branches of bilaterians, Urbilateria probably had some kind of light-sensing organ. Similar inferences from homologous genes adds more detail to this portrait. The first bilateral population probably had some form of appendage, a primitive heart, a through-gut with mouth and anus, and a diverse set of cell types (including photoreceptive, nerve, muscle, digestive, secretory, phagocytic, and contractile).

One of the great mysteries of evolutionary biology is the Cambrian Explosion, an adaptive radiation where dozens of new phyla appear in the fossil record in the span of about 40 million years. The genetic toolkit was fully in place by the time of the Cambrian Explosion. It seems likely that the compilation of the toolkit was a important prerequisite for such a radiation (although ecological factors surely also played a role).

Deep Homology

Consider the evolution of the eye. Evolutionary biologists once thought eyes had evolved independently dozens or even hundreds of times. This remarkable feat of evolution was attributed to strong selective pressure: having a light-sensitive organ just pays off, and the selective pressure is overwhelming enough to induce many species towards the same end product.

But modern genomics has revealed that these “independent inventions” actually derive from the redeployment of the conserved Pax-6 gene.  The diversity of modern eyes is the result of specializations built on top of this basic genetic framework.

More generally, the deep homology hypothesis suggests that the body organization of all bilaterians derives from a substantial swathe of genes that comprise our genetic toolkit. Bilaterians do not invent novel developmental regimens whole-cloth. Rather, once the full toolkit assembled, changes in phyla occurred via alterations of regulatory circuits.

This principle sharpens how scientists explore new hypotheses. For example, humans are unique among primates for our penchant for vocal mimicry (we can learn how to produce novel sounds). But our more distant relatives (parrots, even seals) practice vocal mimicry. Because so much of our genetic material is conserved, we cannot afford to ignore similarities with even our distant relatives.

Until next time.

Intro to Gradient Descent

Part Of: Machine Learning sequence
Content Summary: 800 words, 8 min read

Parameter Space vs Feature Space

Let’s recall the equation of a line.

y = mx + b where m is the slope, and b is the y-intercept.

The equation of a line is a function that maps from inputs (x) to outputs (y). Internal to that model (the knobs inside the box) reside parameters like m and b that mold how the function works.

Any model k can be uniquely described by its parameters \{ m_k, b_k \}. Just as we can plot data in a data space using Cartesian coordinates (x, y), we can plot models in a parameter space using coordinates (m,b).

As we traverse parameter space, we can view the corresponding models in data space.

dualspace_explore.gif

As we proceed, it is very important to hold these concepts in mind.

Loss Functions

Consider the following two regression models. Which one is better?

GradientDescent_ Two Regression Models (1)

The answer comes easily: Model A. But as the data volume increases, choosing between models can become quite difficult. Is there a way to automate such comparisons?

Put another way, your judgment about model goodness is an intuition manufactured in your brain. Algorithms don’t have access to your intuitions. We need a loss function that translates intuitions into numbers.

Regression models are functions of the form \hat{y} = f(\textbf{x}) where x is the vector of features (predictors) used to generate the label (prediction). We can define error as \hat{y} - y. In fact, we typically reserve the word error for test data, and residual for train data. Here are the residuals for our two regression models:

GradientDescent_ Loss Function Residuals

The larger the residuals, the worse the model. Let’s use the residual vector to define a loss function. To do this, in the language of database theory, we need to aggregate the column down to a scalar. In the language of linear algebra, we need to compute the length of the vector.

Everyone agrees that residuals matter, when deriving the loss function. Not everyone agrees how to translate the residual vector into a single number. Let me add a couple examples to motivate:

Sum together all prediction errors.

But then an deeply flawed model with residual vector  [ -30, 30, -30, 30] earns the same score as a “perfect model” [0, 0, 0, 0]. The morale: positive and negative errors should not cancel each other out.

Sum together the magnitude of the prediction errors.

But then a larger dataset costs more than a small one. A good model against a large dataset with residual vector [ 1, -1, 1, -1, 1, -1 ] earns the same score as a poor model against small data [ 3, -3 ]. The morale: cost functions should be invariant to data volume.

Find the average magnitude of the prediction errors.

This loss function suffers from fewer bugs. It even has a name: Mean Absolute Error (MAE), also known as the L1-norm.

There are many other valid ways of defining a loss function that we will explore later. I  just used the L1-norm to motivate the topic.

Grading Parameter Space

Let’s return to the question of evaluating model performance. For the following five models, we intuitively judge their performance as steadily worsening:

dataspace_loss (2)

With loss functions, we convert these intuitive judgments into numbers. Let’s include these loss numbers in label space, and encode them as color.

dualspace_loss (2)

Still with me? Something important is going on here.

We have examined the loss of five models. What happens if we evaluate two hundred different models? One thousand? A million? With enough samples, we can gain a high-resolution view of the loss surface. This loss surface can be expressed with loss as color, or loss as height along the z-axis.

sampling_loss_surface (2)

In this case, the loss surface is convex: it is in the shape of a bowl.

Navigating the Loss Surface

The notion of a loss surface takes a while to digest. But it is worth the effort. The loss surface is the reason machine learning is possible.

By looking at a loss surface, you can visually identify the global minimum: the model instantiation with the least amount of loss. In our example above, that is (7,-14) encodes the model y = 7x - 14 with the smallest loss L(\theta) = 2.

Unfortunately, computing the loss surface is computationally intractable. It takes too long to calculate the loss of every possible model. How can we do better?

  1. Start with an arbitrary model.
  2. Figure out how to improve it.
  3. Repeat.

One useful metaphor for this kind of algorithm is a flashlight in the dark. We can’t see the entire landscape, but our flashlight provides information about our immediate surroundings.

flashlight_local_search

But what local information can we use to decide where to move in parameter space? Simple: the gradient (i.e., the slope)! If we move downhill in this bowl-life surface, we will come to a rest at the set of best parameters.

A ball rolling down a hill.

This is how gradient descent works, in both spaces:

This is how prediction machines learn from data.

Until next time.

Bias vs Variance

Part Of: Machine Learning sequence
Followup To: Regression vs Classification
Content Summary: 800 words, 8 min read

A Taxonomy of Models

Last time, we discussed how to create prediction machines. For example, given an animal’s height and weight, we might build a prediction machine to guess what kind of animal it is. While this sounds complicated, in this case prediction machines are simple region-color maps, like these:

Partitioning- Comparing Classification Outputs

These two classification models are both fairly accurate, but differ in their complexity. 

But it’s important to acknowledge the possibility of erroneous-simple and erroneous-complex models. I like to think of models in terms of an accuracy-complexity quadrant.

BiasVariance_ Classification Quadrant (1)

This quadrant is not limited to classification. Regression models can also vary in their accuracy, and their complexity.

BiasVariance_ Both Quadrants (1)

A couple brief caveats before we proceed.

  • This quadrant concept is best understood as a two-dimensional continuum, rather than a four-category space. More on this later.
  • Here “accuracy” tries to capture lay intuitions about prediction quality & performance. I’m not using it in the metric sense of “alternative to F1 score”.

Formalizing Complexity

Neural networks are often used as classification models against large numbers of images. The complexity of the models tends to correlate with the number of layers. For some models then, complexity is captured in the number of parameters.

While not used much in the industry, polynomial models are pedagogically useful examples of regression models. Here, the degree of the polynomial expresses the complexity of the model: a degree-eight polynomial has more “bumps” than a degree-two polynomial.

Consider, however, the difference between the following regression models

y_A = 4x^4 + 0.0001x^3 + 0.0007x^2 + 2.1x + 7
y_B = 4x^4 + 2.1x + 7

Model A uses five parameters; Model B uses three. But their predictions are, for all practical purposes, identical. Thus, the size of each parameter is also relevant to the question of complexity.

The above approaches rely on the model’s parameters (its “visceral organs”) to define complexity. But it is also possible to rely on the model’s outputs (its “behaviors”) to achieve the same task. Consider again the classification decision boundaries above. We can simply measure the spatial frequency (the “squiggliness” of the boundary) as another proxy towards complexity.

Here, then, are three possible criteria for complexity:

  1. Number of parameters
  2. Size of parameters
  3. Spatial frequency of decision manifold

Thus, operationalizing the definition of “complexity” is surprisingly challenging. But there is another way to detect whether a model is too complex…

Simplicity as Generalizability

Recall our distinction between training and prediction:

train_predict

We compute model performance on historical data. We can contrast this with model performance again future data. 

BiasVariance_ Operationalizing the Quadrant (2)

Take a moment to digest this image. What is it telling you?

Model complexity is not merely aesthetically ugly. Rather, complexity is the enemy of generalization. Want to future-proof your model? Simplicity might help!

Underfitting vs Overfitting

There is another way of interpreting this tradeoff, that emphasizes the continuity of model complexity. Starting from a very simple model, increases in model complexity will improve both historical and future error. The best response to underfitting is increasing the expressivity of your model.

But at a certain point, your model will become too complex, and begin to overfit the data. At that point, your historical error will continue to decrease, but your future error will increase.

BiasVariance_ Complexity vs Accuracy Graph

Data Partitioning: creating a Holdout Set

So you now appreciate the importance of striking a balance between accuracy and simplicity. That’s all very nice conceptually, but how might you go about building a well-balanced prediction machine?

The bias trade-off is only apparent when the machine is given new data!  “If only I had practiced against unseen test data earlier”, the statistician might say, “then I could have discovered how complex to make my model before it was too late”.

Read the above regret again. It is the germinating seed of a truly enormous idea.

Many decades ago, some creative mind took the above regret and sought to reform it: “What stops me from treating some of my old, pre-processed data as if it were new? Can I not hide data from myself?”

holdout_set

This approach, known as data partitioning, is now ubiquitous in the machine learning community.  Historical-Known data is the training set, Historical-Novel data is the test set, aka the holdout set.

How much data should we put in the holdout set? While the correct answer ultimately derives from the particular application domain, a typical rule of thumb:

  • On small data (~100 thousand records), data are typically split to 80% train, 20% test
  • On large data (~10 billion records), data are typically split to 95% train, 5% test

Next time, we will explore cross-validation (CV). Cross-validation is sometimes used instead of, and other times in addition to, data partitioning.

See you then!

Why are humans ecologically dominant?

Part Of: Demystifying Culture sequence
Content Summary: 1100 words, 11 min read

Ecological Dominance

Compared to the erects, sapiens are uniquely ecologically dominant. The emergence of hunter-gatherers out of Africa 70,000 years ago caused:

  • The extermination of hundreds of megafauna species (more than 90%)
  • Dwarfing of the surviving species.
  • A huge increase in the frequency and impact of fire (we used fire to reshape ecosystems to our liking)

12,000 years ago, we began domesticating animals and plants. The subsequent agricultural revolution unlocked powerful new ways to acquire energy, which in turn increased our species’ population density.

  • 9000 BCE:   5 million people
  •          1 CE:   300 million people
  •   2100 CE:   11,000 million people

200 years ago, the industrial revolution was heralded by the discovery of energy transduction: that electricity can be used to run a vacuum, or freeze meat products.

These population explosion correlates with a hefty ecological footprint:

  • We have altered more than one-third of the earth’s land surface.
  • We have changed the flow of two-thirds of the earth’s rivers.
  • We use 100 times more biomass than any large species that has ever lived.
  • If you include our vast herds of domesticated animals, we account for more than 98% of terrestrial vertebrate biomass.

Ecological Dominance_ Vertebrate Biomass

Three Kinds of Theories

As with any other species, the scientist must explain how ours has affected the ecosystem. We can do this by examining how our anatomies and psychologies differ from other animals, and then consider which of these human universals explain our ecological dominance.

Pound for pound, other primates are approximately twice as strong. We also lack the anatomical weaponry of our cousins; for example, our canines are much less dangerous.

So, strength cannot explain our dominance. Three other candidate theories tend to recur:

  1. We are more intelligent and creative. Theories of this sort focus on e.g., the invention of Mode 3 stone tools.
  2. We are more cooperative and prosocial. Theories of this sort focus on e.g., massively cooperative hunting expeditions.
  3. We accumulate powerful cultural adaptations. Theories of this sort focus on e.g., how Inuit technology became uniquely adaptive for their environment.

Let’s take a closer look!

Intelligence-Based Theories

Is intellect the secret for our success? Consider the following theories:

First, generative linguists like Noam Chomsky argue that language is not about communication: recursion is an entirely different means of cognition; the root of our species’ creativity. To him, the language instinct (as a genetic package) appeared abruptly at 70 kya, and transformed the mind from a kluge of instincts to a mathematical, general-purpose processor. Language evolution is said to coincide with the explosion of technology called behavioral modernity.

Second, evolutionary psychologists like Leda Cosmides & John Tooby advocate the massive modularity hypothesis: the mind isn’t general purpose processor; it is instead more like a swiss army knife. We are not more intelligent because we have fewer instincts, but more. Specifically, we accrued hundreds of hunter-gatherer instincts in the intervening millenia and these instincts give us our characteristically human flexibility.

Third, social anthropologists like David Lewis-Williams argues that a change in consciousness made us more intelligent. We are the only species that has animistic spirituality, these are caused by numinous experiences. These altered states of consciousness were the byproducts of our consciousness machinery rearranging itself. Specifically, he invokes Dehaene’s theory that while all mammals experience primary consciousness, only sapiens have second-order consciousness (awareness of their own awareness). This was allegedly the event that caused fully modern language.

Sociality-Based Theories

Is sociality the secret for our success? Consider the following theories:

First, sociobiologists like Edward O Wilson thinks that the secret of our success is because of group selection: that vigorous between-group warfare created selective pressure for within-group cooperation. As our ethnic psychology (and specifically, ethnocentrism) became more pronounced, sapien tribes began behaving much like superorganisms. A useful analogy is eusocial insects like ants, who became are arguably even more ecologically dominant than humans.

Second, historians like Yuval Harari thinks that mythology (fictional orders) is the key ingredient enabling humans to act cooperatively. Political and economic phenomena don’t happen in a vacuum: they are caused by certain ideological commitments e.g., nationalism and the value of a currency. To change our myths is to refactor the social structure of our society.

Culture-Based Theories

Is culture the secret for our success? Consider the following theory:

Anthropologists like Richerson, Boyd and Henrich argue that cumulative cultural knowledge comprises a dual-inheritance system, and propose a theory of gene-culture coevolution. They are that an expanding collective mind gave individuals access to unparalleled know-how. This is turn emboldened our niche stealing proclivities: “like the spiders, hominins could trap, snare, or jet their prey; but the latter could also ambush, excavate, expose, entice, corral, hook, spear, preserve, or contain a steadily enlarging range of food types.” Socially-learned norms induce our cooperation, and socially-learned thinking tools explain our intelligence.

My Take

Contra Chomsky,

Contra Cosmides & Tooby:

  • I agree wholeheartedly with the massive modularity hypothesis. It accords well with modern cognitive neuroscience.
  • While selection endowed us with hunter-gatherer instincts (e.g., folk biology), I don’t think such instincts provide sufficient explanatory power.

Contra David Lewis-WIlliams:

  • I need hard evidence showing that animals never hallucinate, before appropriating numinous experiences as a human universal.
  • Global Workspace Theory (GWT) enjoys better empirical support than integrated information theory.
  • I don’t understand the selective pressure or mechanistic implications for changes to our conscious machinery.

Contra sociality-first theories

  • Group selection is still immersed in controversy, especially the free-rider problem.
  • Why must myths be the causal first movers? Surely other factors matter more..

My own thinking most closely aligns with culture-based explanations of our ecological dominance. This sequence will try to explicate this culture-first view.

But at present, culture-first theories leaves several questions unanswered:

  • What, specifically, is the behavioral and biological signature of a social norm? For now, appeals to norm psychology risk explaining too much.
  • How did our species (and our species alone) become psychologically equipped to generate cumulative culture?
  • If erectus was a cultural creature, why did the rate of technological innovation so dramatically change between erectus and sapiens?

Someday I hope to explore these questions too. Until then.

References

  1. Tim Flannery. The Future Eaters
  2. David Lewis-Williams. The Mind in The Cave.
  3. Yuval Harari. Sapiens.
  4. Henrich, The Secret of Our Success

The Evolution of Faith

Part Of: Demystifying Culture sequence
Content Summary: 1200 words, 12 min read

Context

Recall that human beings have two different vehicles for learning:

  • Individual Learning: using personal experiences to refine behavioral techniques, and build causal models of how the world works.
  • Social Learning: using social interactions to learn what other people have learned.

Today, we will try to explain the following observations:

  • Most cultural traditions have adaptive value.
  • This value typically cannot be articulated by practitioners.

Why should this be the case?

Example 1: Manioc Detoxification

Consider an example of food preparation, provided by Joseph Henrich:

In the Colombian Amazon, a starchy tuber called manioc has lots of nutritional value, but also releases hydrogen cyanide when consumed. If eaten unprocessed, manioc can cause chronic cyanide poisoning. Because it emerges only gradually after years of consuming manioc that tastes fine, chronic poisoning is particularly insidious, and has been linked to neurological problems, paralysis of the legs, thyroid problems, and immune suppression.

Indigenous Tukanoans use a multistep, multiday processing technique that involves scraping, grating, and finally washing the roots in order to separate the fiber, starch, and liquid. Once separated, the liquid is boiled into a beverage, but the fiber and starch must then sit for two more days, when they can then be baked and eaten. Chemical analyses confirm that each major step in the processing is necessary to remove cyanogenic content from the root. [5]

Yet consider the point of view of a woman learning such techniques. She may never have seen anyone get cyanide poisoning, because the techniques work. And she would be required to spend about four hours per day detoxifying manioc. [4]

Consider what might result if a self-reliant Tukanoan mother decided to drop seemingly unnecessary steps from the processing of her bitter manioc. She might critically examine the procedure handed down to her from earlier generations and conclude that the goal of the procedure is to remove the bitter taste. She would quickly find that with the much less labor-intensive process of boiling, she could remove the bitter taste. Only decades later her family would begin to develop the symptoms of chronic cyanide poisoning.

Here, the willingness of the mother to take on faith received cultural practices is the only thing preventing the early death of her family. Individual learning does not pay here; after all, it can take decades for the effects of the poison to manifest. Manioc processing is causally opaque.

The detoxification of dozens of other food products (corn, nardoo, etc) are similarly inscrutable. In fact, history is littered with examples of European explorers imperfectly copying indigenous food processing techniques, and meeting gruesome ends.

Example 2: Pregnancy Taboos

Another example, again from Henrich:

During pregnancy and breastfeeding, women on Fiji adhere to a series of food taboos that selectively excise the most toxic marine species from their diet. These large marine species, which include moray eels, barracuda, sharks, rock cod, and several large species of grouper, contribute substantially to the diet in these communities; but all are known in the medical literature to be associated with ciguatera poisoning.

This set of taboos represents a cultural adaptation that selectively targets the most toxic species in women’s usual diets, just when mothers and their offspring are most susceptible. [2] To explore how this cultural adaptation emerged, we studied both how women acquire these taboos and what kind of causal understandings they possess. Fijian women use cues of age, knowledge, and prestige to figure out from whom to learn their taboos. [3] Such selectivity alone is capable of generating an adaptive repertoire over generations, without anyone understanding anything.

We also looked for a shared underlying mental model of why one would not eat these marine species during pregnancy or breastfeeding: a causal model or set of reasoned principles. Unlike the highly consistent answers on what not to eat and when, women’s responses to our why questions were all over the map. Many women simply said they did not know and clearly thought it was an odd question. Others said it was “custom.” Some did suggest that the consumption of some of the species might result in harmful effects to the fetus, but what precisely would happen to the fetus varied greatly: many women explained that babies would be born with rough skin if sharks were eaten and smelly joints if morrays were eaten.  

These answers are blatant rationalizations: “since I’m being asked for a reason, let me try to think one up now”.  The rationale for a taboo is not perceived by its adherents. This is yet another example of competence without comprehension.

A Theory of Overimitation

Human beings exhibit overimitation: a willingness to adopt complex practices even if many individual steps are inscrutable. Overimitation requires faith, defined here as a willingness to accept information in the absence of (or even contrasting with) your personal causal model.

We have replicated this phenomenon in the laboratory. First, present a puzzle box to a child, equipped with several switches, levers, and pulleys. Then have an adult teach the child how to open the box and get the treat inside. If the “solution” involves several useless procedures e.g., tapping the box with a stick three times, humans will imitate the entire procedure. In contrast, chimpanzees ignore the noise, and zoom in on the causally efficacious steps.

Why should chimpanzees outperform humans in this experiment? Chimpanzees don’t share our penchant for mimicry. Chimpanzees are not gullible by default. They must try to parse the relevant factors using the gray matter between their ears.

Humans fare poorly in such tests, because these opaque practices are in fact useless. But more often in our prehistory, inscrutable practices are nevertheless valuable. We are born to go with the flow.

In a species with cumulative culture, and only in such a species, faith in one’s cultural inheritance often yields greater survival and reproduction.

Is Culture Adaptive? Mostly.

We humans do not spend much time inspecting the content of our cultural inheritance. We blindly copy it. How then can cultural practices be adaptive?

For the same reason that natural selection produces increasingly sophisticated body plans. Communities with effective cultural practices outcompete their neighbors.

Overimitation serves to bind cultural practices together into holistic traditions. This makes another analogy to natural selection apt:

  • Genes don’t die, genomes die. Natural selection transmits an error signal for an entire genetic package.
  • Memes don’t die, traditions die. Cultural selection transmits an error signal for an entire cultural package.

Just as genomes can host individual parasitic elements (e.g., transposons), so too cultural traditions can contain maladaptive practices (e.g., dangerous bodily modifications). As long as the entire cultural tradition is adaptive, dangerous ideas can persist undetected in a particular culture.

Does Reason Matter? Yes.

So far, this post has been descriptive. It tries to explain why sapiens are prone to overimitation, and why faith is an adaptation.

Yet individual learning matters. Without it, culture would replicate but not improve. Reason is the fuel of innovation. We pay attention to intelligent, innovative people because of another cultural adaptation: prestige.

Perhaps the powers of the lone intellect are less stupendous than you were brought up to believe.

But we need not be slaves to neither our cultural nor our genetic inheritance. We can do better.

Related Resources

  1. Henrich (2016). The Secret Of Our Success.
  2. Henrich & Henrich (2010). The evolution of cultural adaptations: Fijian food taboos protect against dangerous marine toxins
  3. Henrich & Broesch (2011). On the nature of cultural transmission networks: evidence from Fijian villages for adaptive learning biases
  4. Dufour (1984). The time and energy expenditure of indigenous women horticulturalists in the northwest Amazon. 
  5. Dufour (1994). Cassave in Amazonia: Lessons in utilization and safety from native peoples. 

[Excerpt] Jesus as Apocalyptic Prophet

Excerpt From: Blog Post
Content Summary: 1800 words, 9 min read

I agree with mainstream scholarship on the historical Jesus (e.g., E.P. Sanders, Geza Vermes, Bart Ehrman, Dale Allison, Paula Fredriksen, et al.) that Jesus was a failed apocalyptic prophet. Such a hypothesis, if true, would be a simple one that would make sense of a wide range of data, including the following twenty:

  1. John the Baptist preached a message of repentance to escape the imminent judgment of the eschaton. Jesus was his baptized disciple, and thus accepted his message — and in fact preached basically the same message.
  2. Jesus’ Son of Man passages are allusions to the son of man figure in Daniel 7:13-14 and Enoch ch 37-71 (both texts were widely discussed in first century Palestine). This figure was an end of the world arbiter of God’s justice, and Jesus kept preaching that he was on his way (e.g., “From now on, you will see the Son of Man sitting at the right hand of Power, and coming on the clouds of heaven.” Matt. 26:64).
  3. The earliest canonical writing: Paul taught of an imminent eschaton, and it mirrors in wording the end-time passages in the synoptics (especially the so-called “Little Apocalypse” in Mark, and the subsequently-written parallels in Matthew and Luke).
  4. Many passages depict Jesus predicting the end within his generation.
    • “The time is fulfilled, and the kingdom of heaven is at hand. Repent and believe the good news” (Mark 1:15)
    • “This generation will not pass away until all these things take place” (Mark 13:30)
    • “You will not finish going through the cities of Israel until the Son of Man comes” (Matthew 10:23)
    • “There are some of those who are standing here who will not taste death until they see the kingdom of God after it has come with power.” (Mark 9:1)
    • “From now on, you shall see the Son of Man coming in the clouds” (Matt 26:64)
  5. A sense of urgency permeates the gospels and the other NT writings. For example:
    • The disciples must hurry to send the message to the cities of Israel before Daniel’s “Son of Man” comes
    • Jesus’ statement that even burying one’s parents has a lower priority
    • Paul telling the Corinthians not to change their current state, since it’s all about to end (e.g., don’t seek marriage, or to leave one’s slave condition, etc., since the end of all things is at hand)
  6. Relatedly, Jesus and Paul taught a radical “interim ethic” (e.g., don’t divorce, radical forgiveness, don’t judge others, love one’s enemies, etc.). This makes sense if they believed that the eschaton would occur within their generation, and that all needed to repent and prepare for its arrival.
  7. Jesus had his disciples leave everything and follow him around. This makes sense if Jesus believed that he and they were to be God’s final messengers before the eschaton.
  8. Jesus gathered twelve disciples, which is the number of the twelve tribes of Israel. He also said they were to sit on twelve thrones and serve as judges of the twelve tribes of Israel. This reflects the common expectation that at the end of days, all twelve tribes would return to the land. The twelve are a symbolic representation of restored Israel.
  9. There is a clear pattern of a successive watering down of Jesus’ prediction of the eschaton within the generation of his disciples, starting with Mark (widely believed among NT scholars to be the first gospel written), and continuing through the rest of the synoptic gospels. By the time we get to John, the last gospel written, the eschatological “kingdom of God” talk is dropped (except for one passage, and it no longer has clear eschatological connotations), along with the end-time predictions, and is replaced with “eternal life” talk. Further, the epistles presuppose that the early church thought Jesus really predicted the end within their lifetimes. Finally, this successive backpedaling continues beyond the NT writings and into those of the apocrypha and the early church leaders, even to the point where some writings attribute an anti-apocalyptic message to Jesus. All of these things make perfect sense if Jesus really did make such a prediction, and the church needed to reinterpret his message in light of the fact that his generation passed away, yet the eschaton never came.
  10. Jesus’ base followers were all considered to represent the “bottom” of society in his day: the poor, sinners, prostitutes, outcasts, tax collectors, lepers, and the demon-possessed. This is perfectly in line with the standard apocalyptic doctrine of the reversal of fortunes when the kingdom of God comes: “the first shall be last, and the last shall be first”.
  11. Jesus performed many exorcisms, which he claimed marked the inbreaking of the kingdom of God on Earth. They were thus signs of the imminent apocalypse. Satan and his minions were being cast out of power, and God’s power was taking its place.
  12. Jesus’ trip to Jerusalem for the Passover Celebration, and his subsequent activities there, are best explained in terms of his apocalyptic message and his perceived role in proclaiming it. Jesus went to the temple during the Passover Festival, and spent many days teaching about his apocalyptic message of the imminent coming kingdom of God. The apocalyptic message included the idea that the temple in Jerusalem would also be destroyed.
  13. Jesus caused a disturbance in the temple itself, which appears to have been a symbolic enactment of his apocalyptic teaching about the temple’s destruction.
  14. Jesus’ betrayal by Judas Iscariot, and Jesus’ subsequent arrest, is best explained in terms of Judas’ betraying to the religious authorities (the Sadducees and the chief priests) Jesus’ teaching (to his inner circle of disciples) that he would be the King of the Jews in the coming Kingdom of God.
  15. Jesus was executed on the charge of political sedition, due to his claim that he was the King of the Jews. His execution was therefore directly related to his apocalyptic message of the imminent coming of the kingdom of God.
  16. The fact that not just all New Testament authors, but the early church as a whole, believed the end would occur in their generation makes perfect sense if Jesus really did make such claims.
  17. The passages that attribute these predictions to Jesus and Paul satisfy the historical criteria of multiple attestation (and forms), embarrassment, earliest strata (Mark, Q, M, L, Paul’s earliest letters, the ancient “Maranatha” creed/hymn) etc., thus strongly indicating that these words go back to the lips of Jesus.
  18. Jesus’ parables: virtually all explicitly or implicitly teach a message about an imminent eschaton.
  19. Jesus’ “inversion” teachings (e.g., “The first shall be last, and the last shall be first”): a common theme among Jewish apocalypticists generally. The general message of apocalypticists is that those who are evil and defy God will not get away with it forever. The just are trampled, and the unjust prosper; thus, this situation needs to be inverted – as it will be when the “Son of Man” from the book of Daniel comes to exact God’s judgment at any moment.
  20. The earliest Christians believed that Jesus’ putative resurrection was (to use Paul’s terminology) the “first fruits” of the general resurrection of the dead at the end of time. This is an agricultural metaphor. When farmers reaped and ate the first fruits of the harvest, they would then reap the full harvest the very next day — the “general” harvest was “imminent”, as it was “inaugurated” with the reaping of the first-fruits. Similarly, the earliest Christians believed that the final judgement and the general resurrection were imminent, given their belief that Jesus’ resurrection was itself the inaugurating event of the general resurrection and the end of all things. Thus, there is a continuity between the beliefs of the early Christians and the beliefs of many Jews of his time: Jesus’ resurrection was fundamentally construed in these eschatological terms

And so, no matter which way you slice it, the “statute of limitations” has run out on Jesus and his apostle’s claim for an imminent end, within a single generation.

It needs to be emphasized that this line of reasoning isn’t controversial among mainstream, middle-of-the-road NT critics. I’m not talking about a view held by the Jesus Seminar, or earlier “radical” form and redaction critics like Norman Perrin. Rather, I’m talking about the kinds of considerations that are largely accepted by moderates who are also committed Christians, such as Dale Allison and John P. Meier. Indeed, conservative scholars of the likes of none other than Ben Witherington and N.T. Wright largely admit this line of reasoning. Why are they still Christians, you ask? I’ll tell you: by giving unnatural, ad hoc explanations of the data. For example,

  1. Meier gets around the problem by arguing that the false prediction passages are inauthentic (i.e., Jesus never said those things; the early church just put those words on the lips of Jesus, and they ended up in the gospels).
  2. Witherington gets around the problem by saying that what Jesus really meant was that the imminent arrival of the eschatological kingdom might be at hand(!)
  3. Wright gets around the problem by adopting the partial preterist line that the imminent end that Jesus predicted really did occur — it’s just that it was all fulfilled with the destruction of Jerusalem.
    1. Oh, really? So are we also to think that since he’s already come again, he’s not coming back? Or perhaps there will be a third coming?
    2. And why does Paul tell various communities very far outside of Israel about the same sorts of predictions of an imminent end that would affect them — one that, like the one Jesus talked about, involved judgement, destruction, and the gathering of all the elect?

Are you convinced by these responses? Me neither. And now you know why nobody outside of orthodox circles buys them, either.

To all of this, I say what should be obvious: you know, deep in your gut (don’t you?) that such responses are unnatural, ad hoc dodges of what we know to be the truth here: Jesus really did predict the end within the lifetime of his disciples, but he was simply wrong.

This isn’t about some remark Jesus said in passing.  It was his central message: “Repent, for the kingdom of heaven is at hand!”

Putting it all together, we get the following argument for Jesus as a failed apocalyptic prophet:

  • Let H1 be the hypothesis that Jesus was a failed apocalyptic prophet of an imminent eschaton.
  • Let H2 be the hypothesis that Jesus is the Son of God of orthodox Christianity.
  • Let D1-20 be the data sketched above.

Then the argument can be expressed as follows:

  1. H1 is a better explanation of D1-20 than H2.
  2. If H1 is a better explanation of D1-20 than H2, then H1 is more probable than H2.
  3. Therefore, H1 is more probable than H2.