Strangers To Ourselves

Part Of: Sociality sequence
Followup To: Intro to Confabulation
Content Summary: 2000 words, 20 min read

We do not have direct access to our mental lives. Rather, self-perception is performed by other-directed faculties (i.e., mindreading) being “turned inwards”. We guess our intentions, in exactly the same way we guess at the intentions of others.

Self-Knowledge vs Other-Knowledge

The brain is organized into perception-action cycles, with decisions mediating these streams.  We can represent this thesis as a simple cartoon, which also captures the abstraction hierarchy (concrete vs abstract decisions) and the two loop hypothesis (world vs body).

Agent files are the mental records we maintain about our relationships with people. Mindreading denotes the coalition of processes that attempt to reverse engineer the mental state of other people: their goals, their idiosyncratic mental states, and even their personality. Folk psychology contrasts this interpretive method of understanding other people with our ability to understand ourselves. 

We have powerful intuitions that self-understanding is fundamentally different than other-understanding. The Cartesian doctrine of introspection holds that our mental states and mechanisms are transparent; that is, directly accessible to us. It doesn’t matter which mental system generates the attitude, or why it does so – we can directly perceive all of this. 

Our Unconscious Selves

Cartesian thinking has fallen out of favor. Why? Because we discovered that most mental activity happens outside of conscious awareness.

A simple example should illustrate. When we speak, the musculature in our vocal tracts contort in highly specific ways. Do you have any idea which muscles move, and in which direction, to speak? No – you are merely conscious of the high-level desire. The way that those instructions are cached out at the more detailed motor commands is opaque to you. 

The first movement against transparency was Freud, who championed that a repression hypothesis: that unconscious beliefs are too depraved to be admitted to consciousness. But, after a brief detour through radical behaviorism, modern cognitive psychology tends to avow a plumbing hypothesis: that unconscious states are too complex (or not sufficiently useful) to merit admission to consciousness.

The distinction between unconscious and conscious processes can feel abstract, until you grapple with the limited capacity of consciousness. Why is it possible to read one, but not two books simultaneously? Why is it possible for most of us to remember a new phone number, but not the first twenty digits of pi, after the first 15 minutes of exposure? 

The ISA Theory

The Interpretive Sensory-Access (ISA) theory holds that our conscious selves are completely ignorant of our own mental lives save for the mindreading faculty. That is, the very same faculty used in our social interactions also constructs models of ourselves. 

It is important to realize that the range of perceptual data available for self-interpretation is larger than that available for people outside of ourselves. For both types of mindreading, we have perceptual data on various behaviors. In the case of self-mindreading, we also have access to our subvocalizations (inner speech) and the low-capacity contents of the global broadcast, more generally. 

Perhaps our mindreading faculties are more accurate, given they have more data on which to construct a self-narrative. 

The ISA theory explains the behavior-identity bootstrap; i.e., why the “fake it until you make it” proverb is apt. By acting in accordance with a novel role (e.g., helping the homeless more often), we gradually begin to become that person (e.g., resonating to the needs of others more powerfully in general). 

Theses, Predictions, Evidence

The ISA theory can be distilled into four theses:

  1. There is a single mental faculty underlying our attributions if propositional attitudes, whether to ourselves or to others
  2. That faculty has only sensory access to its domain
  3. Its access to our attitudes is interpretive rather than transparent
  4. The mental faculty in question evolved to sustain and facilitate other-directed forms of social cognition. 

The ISA theory is testable. It generates the following predictions:

  1. No non-sensory awareness of our inner lives
  2. There should be no substantive differences in the development of a child’s capacities for first-person and third-person understanding. 
  3. There should be no dissociation between a person’s ability to attribute mental states to themselves and to others. 
  4. Humans should lack any form of deep and sophisticated metacognitive competence. 
  5. People should confabulate promiscuously. 
  6. Any non-human animal capable of mindreading should be capable of turning its mindreading abilities on itself. 

These predictions are largely borne out in experimental data:

  1. Introspection-sampling studies suggest that some people believe themselves to experience non-sensory attitudes. These data is hard for ISA theory to accommodate. But it is also hard for introspection-based theories to reconcile with – if we had transparent access to our attitudes, why do some people only experience them with a sensory overlay?
  2. Wellman et al (2001) conducted a meta-analysis of well over 100 pairs of experiments in which children had been asked, both to ascribe a false belief to another persons and to attribute a previous false belief to themselves. They were able to find no significant difference in performance, even at the youngest ages tested. 
  3. Other theorists (e.g., Nichols & Stich 2003) claim that autism exemplifies deficits in other-k but not in self-k, and schizophrenia is an impairment of self-k but not other-k. But on inspection, these claims have weak if nonexistent empirical support. These syndromes injure both forms of knowledge.
  4. Transparent self-knowledge should entail robust metacognitive competencies. But we do not.  For example, the correlation between people’s judgments of learning and later recall are not very strong (Dunlosky & Metcalfe (2009)). 
  5. The philosophical doctrine of first-person authority holds that we cannot hold false beliefs about our mental lives. The robust phenomena of confabulation discredits this hypothesis (Nisbett & Wilson (1977)). We are allergic to admitting “I don’t know why I did that”; rather, we invent stories about ourselves without realizing their contrived nature. I discuss this form of “sincere dishonesty” at length here.
  6. Primates are capable of desire mindreading, and their behavior is consistent with their possessing some rudimentary forms of self-knowledge.

The ISA theory thus receives ample empirical confirmation.

Competitors to ISA Theory

There are many competitors to the ISA account. For the below, we will use attitude to denote non-perceptual mental representations such as desires, goals, reasons and decisions. 

  1. Source tagging theories (e.g., Rey 2013) hold that, whenever the brain generates a new attitude, the generating system(s) add a tag indicating their source. Whenever that representation is globally broadcast, our conscious selves can inspect the tag to view its origin. 
  2. Attitudinal working memory theories (e.g., Fodor 1983, Evans 1982) hold that, in addition to a perception-based working memory system, there is a separate faculty to broadcast conscious attitudes and decisions. 
  3. Constitutive authority theories (e.g., Wilson 2002, Wegner 2002, Frankish 2009) admit that conscious events (e.g., suppose we say I want to go to the store) do not directly cause action. However, we do attribute these utterances to ourselves, and the subconscious metanorm I DESIRE TO REALIZE MY COMMITMENTS works to translate these conscious self-attributions to unconscious action programs. 
  4. Inner sense theories hold that, as animal brains increased in complexity, there was increasing need for cognitive monitoring and control. To perform that adaptive function, the faculty of inner sense evolved to generate metarepresentations: representations of object-level computational state. There are three important flavors of this theory:

But there are data speaking against these theories

  1. Contra source tagging, the source monitoring literature shows that people simply don’t have transparent access to the sources of their memory images. For example, Henkel et al (2000) required subjects to either see, hear, imagine as seen, or imagine as heard, a number of familiar events, such as a basketball bouncing. But people frequently misremembered which of these four mediums had produced their memory, when asked later. 
  2. The capacity limits of sensory-based working memory explains nearly the entire phenomena of fluid g, also known as IQ (Colom et al 2004). If attitudinal working memory evolved alongside this system, it is hard to explain why it doesn’t contribute to fluid intelligence scores. 

More tellingly, however, each of the above theories fails to explain confabulation data. Most inner sense theories today (e.g., Goldman 2006) adopt a dual-method stance: when confabulating, people are using mindreading; else people are using transparent inner sense. But as an auxiliary hypothesis, dual-method theories fail to explain the patterning of when a person will make correct versus incorrect self-attributions. 

Biased ISA Theory

The ISA theory holds self-knowledge to be grounded in sparse but unbiased perceptual knowledge. But this does not seem to be the whole story. For we know that we are prone to overestimate the good qualities of the Self and Us, but underestimate the bad qualities of the Other and Them. 

For example, the fundamental attribution error describes the tendency to explain our own failings as contingent on the situation, but the failings of others to immutable character flaws. More generally, the argumentative theory of reasoning posits a justification faculty which subconsciously makes our reasons rosier, and our folk sociology faculty demonizes members of the outgroup. 

In social psychology, there is a distinction between dispositional beliefs (avowals that are generated live) and standing beliefs (those actively represented in long-term memory). The relationship between the content of what one says and the content of the underlying attitude may be quite complex. It is unclear whether these parochial biases act upon standing or dispositional beliefs. 

Explaining Transparency

The following section is borrowed from Carruthers (2020). 

In general, our judgments of others’ opinions come in two phases:

  1. First pass representation of the attitude expressed, relying on syntax, prosody, and the salient feature of conversational context.
  2. Lie Detection. Whenever the degree of support for the initial interpretation is lower than normal, or there is a competing interpretation in play that has at least some degree of support, or the potential costs of a misunderstanding are higher than normal, a signal would be sent to executive systems to slow down and issue inquiries more widely before a conclusion is reached. 

Why do our self-attributions feel transparent? Plausibly, because, the attribution of self-attitudes only undergo the first stage (not subject to disambiguation and lie detection systems). This architecture would likely generate the following inference rules:

  1. One believes one is in mental state M → one is in mental state M.
  2. One believes one isn’t in mental state M → one isn’t in mental state M.

The first will issue in intuitions of infallible knowledge, and the second in the intuition that mental states are always self-presenting to their possessors.

For example, consider the following two sentences

  1. John thinks he has just decided to go to the party, but really he hasn’t. 
  2. John thinks he doesn’t intend to go to the party, but really he does.

These sentences are hard to parse, precisely because the mindreading inference rules render them strikingly counterintuitive.

These intuitions may be merely tacit initially, but will rapidly transition into explicit transparency beliefs in cultures that articulate them. Such beliefs might be expected to exert a deep “attractor effect” on cultural evolution, being sustained and transmitted both because of their apparent naturalness. And indeed, transparency doctrines have been found in traditions from Aristotle, to the Mayans, to pre-Buddhist China.

Until next time. 

Inspiring Materials

These views are more completely articulated in Carruthers (2011). For a lecture on this topic, please see:

Works Cited

  • Carruthers (2011). The Opacity of Mind
  • Carruthers (2020). How mindreading might mislead cognitive science
  • Colom et al (2004). Working memory is (almost) perfectly predicted by g
  • Evans (1982). The Varieties of Reference
  • Henkel et al (2000). Cross-modal source monitoring confusions between perceived and imagined events
  • Fodor (1983). The Modularity of Mind
  • Goldman (2006). Simulating Minds. 
  • Frankish (2009). How we know our conscious minds. 
  • Nichols & Stitch (2003). Mindreading: An Integrated Account of Pretence, Self-Awareness, and Understanding Other Minds
  • Nisbett & Wilson (1977). Telling more than we can know: verbal reports on mental processes.
  • Rey (2013). We aren’t all self-blind: A defense of modest introspectionism
  • Wilson (2002). Strangers to ourselves
  • Wegner (2002). The illusion of conscious will

Two Mindreading Systems

Part Of: Sociality sequence
Followup To: Counterfactual Simulation
Content Summary: 1200 words, 12 min read

A Brief Review

Mindreading (also known as mentalizing, the intentional stance, or theory of mind) is the penchant of animals to represent the mental lives of one another. What are the beliefs and desires of those around us? A classic demonstration of mindreading comes from Heider & Simmel (1944):

While the mindreading faculty was designed to understand the minds of other animals, it had no trouble ascribing beliefs and goals to two dimensional shapes. This is roughly analogous to your email provider accepting a tennis ball as a login password.

Another classic demonstration of mindreading is the Sally-Anne test, from Baren-Cohen et al (1985):

Super-processes: Two Stages of Mindreading

In fact, Mindreading can be conceptualized as two interlocking systems.

  • Stage 1: Goal Mindreading is capable of reasoning about goals and the perceptual access of other beings. It generates expectations on how people are likely to behave given their goals, and what they can see and hear.
  • Stage 2: Representation Mindreading is capable of reasoning about the concepts of other beings. It is the engine associated with pretense, lie detection, and noticing errors in others. 

From a developmental perspective, these two systems emerge at different times. Evidence adduced in Gergely et al (1994) reveal that goal mindreading emerges at twelve months. Ability to pass the Sally-Anne test arrives at 44 months, except in autistic children whose ability to pass is severely delayed, per Baren-Cohen et al (1985)). However, recent evidence suggests that representation mindreading arrives much earlier than 4 years of age: looking time studies demonstrate infants are surprised by violations of false-belief scenarios, which indicate their brains are generating the underlying expectancies. 

From a comparative biology perspective, there is extensive evidence of goal mindreading in non-human animals, including primates, corvids and canids. Call & Tomasello (2008) reviews mindreading studies conducted on chimpanzees, demonstrating in great detail that chimpanzees generate behavior responsive to the goals they perceive in other living things. 

However, at the time of this writing, most researchers agree there is negative evidence of representation mindreading.  

Taking the evidence from ontology and phylogeny together, we can see the gradual accretion of mental faculties over time:

Sub-processes: Components of Mindreading

The goal mindreading versus representation mindreading is undeniably helpful. But it is too coarse to shed light on mechanism. More work needs to be done to identify the basis functions underlying mindreading. In the language of the theoretician’s quadrant, to move forward, we must engage in a Q3 exercise. 

We have begun to sketch an outline of these basis functions during previous discussions of social phenomena. 

Other mindreading-impacting phenomena we have not yet discussed include:

  • Agency Detection. When the natural world violates our expectations (a leaf moves against gravity), often these events are caused by an (presently unseen) agent. Mismatches between agent and agency detection are thought to generate the intuitions that underlie our species’ folk animism. 
  • Emotion Contagion
  • Friendship behaviors.
  • Shared Attention mechanisms. Before we can reason about the beliefs of another agent, we must learn to 
  • Cultural Psychological mechanisms. We have yet to discuss prestige biases, and our compulsive need to share information.
  • Six Pillars of Selfhood. Kahneman distinguishes the Remembering vs Experiencing self. There’s 

The following graphic attempts to bring together these subcomponents into a 10,000 foot view of the system. For more on this train of thought, I recommend Schaafsma et al (2015).

Clearly, representation mindreading involves more than a single faculty, but rather deploys a broad coalition of social faculties. It is likely that distinct “mindreading tasks” employed in experiments typically recruit coalitions with subtly different profiles. 

Relationship to ICNs

The cognitive neuroscience community has converged on a set of neural mechanisms underlying goal and representation mindreading. These five regions are:

  1. Medial Prefrontal Cortex (MPFC)
  2. Posterior Cingulate Cortex (PCC)
  3. Temporo-Parietal Junction (TPJ)
  4. Superior Temporal Sulcus (STS).
  5. Temporal Pole (TP). 

We have begun localizing specific functions to these five regions-of-interest. The TPJ seems to be the key site for representation mindreading, whereas goal mindreading is produced by the other sites; with the temporal pole appearing to underlie desire attribution specifically.

Scientific consensus is hard to achieve without a deluge of data; this network is here to stay. But there are two reasons to hesitate before drawing further conclusions. First, “mindreading” is probably not a natural kind; neural mechanisms probably map to more granular functions that join together to produce both macrosystems.

Second, these five regions must be structurally understood in terms of intrinsic connectivity networks (ICNs), and this work has not yet been undertaken. In my writeup of ICNs, we described evidence for five “processing networks”:

  1. Default mode network (and its three subcomponents)
  2. Salience Network and the closely related Ventral Attention Network (VAN)
  3. Dorsal Attention Network (DAN)
  4. Fronto-Parietal Control Network (FPCN) implicated in volitional control and willpower
  5. Cingulo-Opercular Control Network (COCN), implicated in working memory rehearsal and fluid intelligence.  

The five regions of interest above are a subset of what social cognition theorists describe as the sociality network. In turn, the sociality network seems to comprise a subset of the default mode network. An increasing number of theorists are gesturing towards three subnetworks within the DM network, with mindreading modules mostly but not entirely residing within one of those subnetworks. Further, we have evidence that the default mode network is the basis of interoception and allostasis (that is, the brain’s unconscious representation of the body aka the hot loop). 

These hints are suggestive. But precious little of our knowledge is detailed enough to be formalized and modeled. Someday I will be able to say more about the relationship between sociality, mindreading, interoception, and the default mode network. But that is not yet possible in 2020… at least, as far as I know.

Until next time. 

References

  • Baren-Cohen et al (1985) Does the autistic child have a “theory of mind”?
  • Call & Tomasello (2008). Does the chimpanzee have a theory of mind? 30 years later
  • Gergely et al (1994). Taking the intentional stance at 12 months of age
  • Heider & Simmel (1944) An experimental study of apparent behavior
  • Schaafsma et al (2015). Deconstructing and reconstructing theory of mind

Intrinsic Connectivity Networks

Part Of: Neuroanatomy sequence
Content Summary: 2200 words, 22 min read

Four Cortical Networks

Cognitive neuroscience typically employs fMRI scans under a carefully crafted task structure. Such research localized various task functions to different neural structures (cortical areas). For example, these studies produced evidence suggesting that the hippocampus is the seat of autobiographical memory. 

In the early 2000s that researchers stumbled upon a different question, what brain regions are active when the brain is at rest? Here is Raichle (2015) describing his discovery of the default mode network

One of the guiding principles of cognitive psychology at that time was that a control state must explicitly contain all the elements of the associated task other than the one element of interest (e.g., seeing a word versus reading the same word). Using a control state of rest would clearly seem to violate that principle. Despite our commitment to the strategies of cognitive psychology in our experiments, we routinely obtained resting-state scans in all our experiments, a habit largely carried over from experiments involving simple sensory stimuli, in which the control state was simple the absence of the stimulus. At some point in our work, and I do not recall the motivation, I began to look at the resting-state scans minus the task scans. What immediately caught my attention was the fact that regardless of the task under investigation, the activity decreases almost always included the posterior cingulate and the adjacent precuneus. 

Well before the discovery of the default mode network, Peterson and Posner (1980) had put forward three networks underlying attention. The dorsal attention network generated salience maps across the perceptual field, and used these maps to orient to interesting stimuli. The ventral attention network is involved in attention switching to novel stimuli. The executive network produces top-down control of attention, for example translating the instruction “pay attention to the green triangle” to sustained attention on an otherwise-uninteresting object. 

Fox et al (2005) brought these two worlds together in their seminal paper, which identified a brain-wide task-positive network which anti-correlated with their task-negative network. Their use of resting-state functional connectivity MRI (rs-fcMRI) provided independent evidence of the existence of these networks.  

Their task-negative network was the default mode network. And the task-positive network seemed to contain two networks previously identified: the executive network, and the dorsal top-down attention network. The ventral attention network, however, was not identified in their analysis.

And that was the state of the world in 2006. Neuroscientists had identified four networks, which we will henceforth call intrinsic connectivity networks (ICNs). They are:

  1. Executive Control
  2. Dorsal Attention
  3. Ventral Attention
  4. Default Mode Network

Towards Eight Networks

While the data supporting the legitimacy of these networks was strong, these anatomical structures pose a fairly routine challenge in neuroscience: they correlate with “too many functions”. Take the default mode network. It is associated with mind-wandering, social cognition, self-reference, semantic concepts, and autobiographical memory. How could one structure produce these widely divergent behaviors?

In the case when you have too many functions, you have two options: look for more specific mechanisms (Q3), and group similar concepts (Q4). In many neuroscience applications, the former is more productive: reality has a surprising amount of detail.

Researchers began to find subnetworks within the executive control. 

Dosenbach et al (2007) found two networks within the “executive network”. They found a fronto-parietal control network (FPCN), involved in error correction, and control over task execution. They also found a cingulo-opercular control network (COCN), involved in task set maintenance. The FPCN was most active at task onset and errors, the COCN expressed activity consistently throughout the task.

These subgraphs usefully pick out useful psychological concepts. We have long known that rehearsal increases working memory capacity from 3 to 7 chunks. It seems the COCN produces this miracle (but recall that the contents of working memory, the stuff it rehearses, lives in perceptual cortex, Postle 2006). Likewise, psychologists have long studied the phenomenon of willpower or volition. The FPCN might be the neural substrate of this ability. 

Seeley et al (2007) also found substructures within the original executive network. But they didn’t see a rehearsal system in the cingulo-opercular regions. Instead, they found a salience network, which bound affective and emotional information into perceptual objects, and links to the basal ganglia reward system. 

Since publication, each of these networks have been replicated dozens of times, using a widely diverging set of paradigms (ROI vs voxel granularity, fMRI vs rs-fcMRI) and statistical techniques (graph theory, dynamical causal modeling, hierarchical clustering, and independent component analysis). 

Unfortunately, these subnetworks looked and behaved radically differently. For years, neuroscientists collected data using these diverging theories. Peterson & Posner (2012) updated theory of attention rely on Dosenbach’s rehearsal network, whereas many other articles took inspiration from Seeley’s salience network. 

And then, a miracle. Power et al (2011), using graph theoretic tools and more granular data, identified both salience and rehearsal networks hidden within the cingulo-opercular graph. Despite the close proximity of these two networks, they perform dramatically diverging functions (left image).

They also discussed the spatial distribution of these networks across cortex. Essentially, the attention networks are sandwiched between sensorimotor networks and prefrontal control networks. This configuration might play an important role in reducing wiring cost for between-network communication. 

Default Mode Network and Interoception

Power et al (2011) also compared network properties of their ICNs and discovered two categories of ICN:

  • processing networks that are directly involved in perceptual-action loops. These networks tend to be very modular in their organization.
  • control networks that modulate cybernetic loops. These networks tend to have more extra-subgraph relationships.

The above illustrates an intriguing finding: the default mode network is a processing network, rather than a control network. But what sense modality does underlie? 

The answer is straightforward to an affective neuroscientist. The default mode network and the salience network comprise the seat of the hot loop; it performs:

  • interoception (viscerosensory body perception); and
  • allostasis (visceromotor body regulation)

It is a cornerstone of dual cybernetic loops. Indeed, comparative studies with macaque monkeys put empirical meat on this assertion:

  • { anterior cingulate cortex, dorsal amygdala, ventral anterior insula } perform visceromotor functions (allostasis)
  • { dorsal anterior insula } perform viscerosensory functions (interoception). 

As Kleckner et al (2017) show, these assertions are born out by myriad human rs-fcMRI studies, and further bolstered by tract-tracing studies in non-human animals.

I’ll note in passing that most experts now detect three subgraphs within the default mode network (cf Andrews-Hanna et al (2014)). But the functional signature of these subgraphs has not yet been worked out, so let me simply note this development in passing.

Network Neuroscience

We have so far discussed results from function-derived structures, with techniques such as rs-fcMRI computing ICNs from the dynamics of neural activity. A complementary research tradition can be described as anatomy-derived structures, which is a more anatomical emphasis on connectome studies. These two network types have important differences, including time scales (anatomy-derived structures tend to persist longer than task-dependent structures) and levels of detail (neuron versus region-of-interest). Nevertheless, these data can be made to usefully constrain one another (functional networks are beginning to look more like structural networks, and vice versa). 

These approaches have recently coalesced (Basset & Sporns (2017)) into the new discipline of network neuroscience. Very similar techniques are used in network science and social network analysis in the analysis of social networks. 

If a neuron is a node in a graph, and a synapse is an edge, what properties does the graph of a human brain enjoy? There are several kinds of networks possible. Regular networks enjoy rich local connections, but few cross-graph connections. Random networks enjoy more long-term connections, but are less structured. Small-world networks represent a kind of middle ground, with lots of local structure but also afford the ability to make long-term connections.

With graph theoretic measures, we can quantitatively partition networks into sets of modules.  A hub is a node with high degrees of centrality (e.g. node degree: how many edges that node supports). A connector hub facilitates between module communication; a provincial hub promotes communication within modules. 

Connectome studies (anatomy-derived structural networks) have shown that brain hub regions are more densely interconnected than predicted on the basis of their degree alone. This set of unusually central connector hubs is called the rich club. The rich club is the most metabolically expensive areas of cortex: they are “high cost, high value”.  They are loosely analogous to DNS servers (the thirteen servers are the global basis of the internet)

Human neural architecture is thus a specific kind of small-world network, one equipped with a “rich club”. These topologies have been shown to exist in other species, such as macaque monkeys and cats. Interestingly, some hubs (posterior cingulate, precuneus, and medial frontal cortex) act as sinks (more afferent than efferent connections) whereas and hubs within attentional networks (incl. dorsal prefrontal, posterior parietal, visual, and insular cortex) act as sources (more efferent than afferent connections). 

What does this have to do with ICNs? As shown by von den Heuval & Sporns (2013b), the rich club seems to be the substrate of inter-ICN communication. 

Networks vs Consciousness

According to global workspace theory, consciousness contents are generated via a publicity organ which selects perceptual information worthy of further processing by downstream modules. There is, however, much disagreement about the mechanism of conscious contents. Theories include:

  1. Dehaene and Changeux have focused on frontal cortex 
  2. Edelman and Tononi on complexity in re-entrant thalamocortical dynamics 
  3. Singer and colleagues on gamma synchrony
  4. Flohr on NMDA synapses
  5. Llinas on a thalamic hub
  6. Newman and Baars on thalamocortical distribution from sensory cortex

Shanahan (2012) offered a new hypothesis, that the rich club has recently hypothesized as the basis of consciousness. Its central location and role synchronizing large-scale brain networks makes it a plausible suspect. However, it is unclear whether the rich club is primarily facilitated by corticocortical white matter, or corticothalamic reentrant loops. If the latter, the hypothesis would converge with existing theories that emphasize the role of the thalamus.

There is some evidence that the thalamus facilitates ICNs. Habas et al (2009) found strong links between cerebellar substructures and various ICNs. This finding is suggestive because cerebellar error signals as passed to cortex through the thalamus. 

Networks vs Modules

ICNs comprise a central organizing principle of the nervous system. But they are not the only such principle; we have identified some fifteen others!

It is difficult to reconcile intrinsic connectivity networks (ICNs) with massive modularity, so that will be the topic of this section.

ICNs have been seized upon by some theorists in the Bayesian predictive coding traditions (e.g. Barrett & Simmons (2015)) as evidence of the illegitimacy of modules. But most ICN theorists still admit the centrality of modules (e.g., Sporns & Betzel (2015)). Here, for example, is von den Heuval & Sporns (2013a):

Since the beginning of modern neuroscience, the brain has generally been viewed as an anatomically differentiated organ whose many parts and regions are associated with the expression of specific mental faculties, behavioral traits, or cognitive operations. The idea that individual brain regions are functionally specialized and make specific contributions to mind is supported by a wealth of evidence from both anatomical and physiological studies. These studies have documented highly specific cellular and circuit properties, finely tuned neural responses, and highly differentiated regional activation profiles across the human brain. Functional specialization has become one of the enduring theoretical foundations of cognitive neuroscience. 

Most researchers now admit the interaction of both principles (specialization and integration). It is unclear how it could be otherwise. I have personally read far too many papers that have described activity in the dorsolateral prefrontal cortex as task-specific, without considering it is a simple expression of the volitional control or working memory rehearsal networks. Similarly, I have read dozens of reviews of the anterior insula that would have profited from the realization that it participates in at least three different ICNs. 

The three streams hypothesis integrates notions of massive modularity, cortical streams, the abstraction hierarchy, and the cybernetic loop hypothesis. It is less clear how ICNs might integrate with these organizing principles. 

Does the ventral temporal parietal junction (vTPJ) only perform integrative functions in service of the ventral attention network (VAN)? Or does the real estate claimed by these ICNs also used to perform specialized computations such as mindreading? The latter proposition strikes me as more likely. But I’d like to see more data on this. To be continued…

Wrapping Up

The human cortex has intrinsic connectivity networks (ICNs) that coordinate to provide integrative services on behalf of our central nervous system. Researchers have so far identified the following networks:

  • Default mode network (and its three subcomponents)
  • Salience Network and the closely related Ventral Attention Network (VAN)
  • Dorsal Attention Network (DAN)
  • Fronto-Parietal Control Network (FPCN) implicated in volitional control and willpower
  • Cingulo-Opercular Control Network (COCN), implicated in working memory rehearsal and fluid intelligence.  

Until next time.

Works Cited

I’ve put the papers I found especially helpful in bold.

  1. Andrews-Hanna et al (2014). The default network and self-generated thought: component processes, dynamic control, and clinical relevance. 
  2. Bassett & Sporns (2017). Network Neuroscience
  3. Barrett & Simmons (2015). Interoceptive predictions in the brain. 
  4. Christoff et al (2016). Mind-wandering as spontaneous thought: a dynamic framework
  5. Dosenbach et al (2007). Distinct brain networks for adaptive and stable task control in humans
  6. Fox et al (2005). The human brain is intrinsically organized into dynamic, anticorrelated functional networks
  7. Kleckner et al (2017). Evidence for a large-scale brain system supporting allostasis and interoception in humans. 
  8. Laird et al (2011). Behavioral Interpretations of Intrinsic Connectivity Networks
  9. Habas et al (2009). Distinct Cerebellar Contributions to Intrinsic Connectivity Networks
  10. Peterson & Posner (1990). The attention system of the human brain
  11. Peterson & Posner (2012). The Attention System of the Human Brain: 20 Years After
  12. Postle (2006). Working memory as an emergent property of the mind and brain.
  13. Power et al (2011). Functional Network Organization of the Human Brain
  14. Raichle (2015). The Brain’s Default Mode Network
  15. Seeley et al (2007). Dissociable Intrinsic Connectivity Networks for Salience Processing and Executive Control
  16. Shanahan (2012). The Brain’s Connective Core and its Role in Animal Cognition
  17. Sporns & Betzel (2015). Modular Brain Networks
  18. Von den Heuval & Sporns (2013a). Network hubs in the human brain 
  19. Von den Heuval & Sporns (2013b) An Anatomical Substrate for Integration among Functional Networks in Human Cortex

Consciousness as a Learning Device

Part Of: Consciousness sequence
Content Summary: 1600 words, 16 min read
Inspiration: Baars (1998) A Cognitive Theory of Consciousness.

Automatization in Tasks

Almost everything we do, we do better unconsciously than consciously. In first learning a new skill we fumble, feel uncertain, and are conscious of many details of action. Once the task is learned, we lose consciousness of the details, forget the painful encounter with uncertainty, and sincerely wonder why beginners seem so slow and awkward. 

In dual task paradigms, subjects are asked to perform two tasks simultaneously. Performance is often poor, because of the limited capacity of consciousness. But when a subject extensively practices one of these tasks, the task will stop interfering with others, and performance improves.

Consider reading, the act of translating visual letters into conceptual meaning. Reading proceeds automatically. If you see the word “pink”, it is nearly impossible to avoid subvocalizing and imagining the color (inner speech and semantic recall). You are not aware of identifying individual letters, or searching your memory for the requisite sounds and meanings – they just occur.

Driving a car is yet another example of a skill that becomes automatic:

When we first learn to drive a car, we are very conscious of the steering wheel, the transmission lever, the foot pedals, and so on. But once having learned to drive, we minimize consciousness of these things and become mainly concerned with the road, with turns in the road, traffic to cope with, and pedestrians to evade. The mechanics of driving become part of the unconscious frames within which we experience the road. 

But even the road can be learned to the point of minimal conscious involvement if it is predictable enough: then we devote most of our consciousness to thinking of different destinations, of long-term goals, and so forth. The road has itself now become “framed”. The whole process is much like Alice moving through the Looking Glass, entering a new reality, and forgetting for the time being that it is not the only reality. Things that were previously conscious become presupposed in the new reality. In fact, tools and subgoals in general become framed as they become predictable and automatic.

Why, when the act of driving becomes automatic, do we become conscious of the road? Presumably the road is much more informative within our purposes than driving has become. Dodging another car, turning a blind corer, braking for a pedestrian – these are much less predictable than the handling of the steering wheel. 

The process of automatizing a skill is called habituation. Habituation involves an increase in performance and a decrease in demand for cognitive resources. But it also involves:

  • loss of self-monitoring: an unpracticed beginner is aware of their own performance, but an expert practitioner can be deceived into believing her performance was much less than its actual value.
  • loss of long-term working memory. Consider, in typing, which finger is used to type the letter c? Most people have to consult their fingers to find out the answer

Suppose someone is given a shape from among the following set, and asked to memorize it. They then receive pairs of other images, and select which one is more similar. 

Pani (1982) found that, as subjects practiced the task, the original image faded from consciousness even as the responses became faster and more accurate. 

Automatization in Perception

The Pani experiment suggests that not merely actions that move to autopilot. Perception can fade from consciousness as well.

Consider the pressure of the chair you are sitting in. Before I mentioned it, that tactile sensation had likely faded into the background. In contrast, the visual experience of reading these words was very much at the center of your conscious experience. 

What is the difference between the tactile quality of the chair and the visual experience of these words?  Redundancy! The chair feels very similar one moment to the next, whereas each new word has a subtly different experience. 

These redundancy effects are pervasive. Consider the experience of moving to an area with a distinctive smell. For the first few days, the smell is at the forefront of your conscious experience; but over time, this redundant sensation fades to the background.

We have seen redundant touch and smell fade from consciousness. Why don’t we become blind to redundant visual information?

Unlike touch and smell, our fovea constantly move across the visual field in an involuntary movements called saccades. This might be one way that the visual system combats redundancy.

If you mount a tiny projector on a contact lens firmly attached to the eye, you can ensure that the visual image is invariant to eye movements. Pritchard et al (1960) found that in such conditions, the visual image fades in a few seconds. Similarly, when people look at a bright but featureless field (the Ganzfeld), they experience “blank outs” – periods when visual perception seems to fade altogether. (Natsoulas, 1982). When vision is not protected by saccades, it behaves just like the other senses.

Becoming blind to redundant information is not limited to perception. Semantic satiation occurs when a person repeats the same word over and over again, until the word starts to feel foreign and arbitrary. Try this for yourself, say “gum” to yourself 50 times and see what happens. 

There is a school of thought that interprets these redundancy effects as anatomical fatigue (perhaps processing the same image dozens of times exhausts neurotransmitters in the relevant microcircuits). But these interpretations are confounded by our ability to surprised by the lack of a stimulus, which implies that the redundancy is encoded in terms of information rather than energy.

It is also worth noting that redundant perceptions do not fade into the background if they are highly relevant to the organism’s health and goals. Chronic pain and hunger fall under this rubric. These are, however, exceptions to the rule. 

Errors and Curiosity

When we experience difficulty performing automatized tasks, consciousness access returns.

  • In reading, lexical access becomes automatic. But simply turning a book upside down will interfere with our reading proficiency, and the perceptual details of “stitching letters to form words” comes back to us.
  • In visual matching, our ability to describe the original target image disappears as we become proficient. But by simply increasing task complexity, our ability to describe the target image returns.
  • In driving, if we move to a new city, our routing autopilot procedures evaporate, and we are more conscious of navigational decisions. If we buy a new car with different operating characteristics (a more sensitive brake pedal, and less sensitive steering control), the mechanical details of driving flood back into our consciousness. 

It seems that consciousness is used to debug automatic processes that run into difficulties.

We often tire of practicing tasks that we have mastered. We often tire of receiving sense data we can fully anticipate. In the case where our brain has fully habituated to some phenomena (and indeed, often before that point is reached), curiosity moves our attention towards other domains. This impulse towards novelty is one way our brain builds a diverse coalition of mental modules capable of responding to an intrinsically complicated world.

Towards A Theory of Conscious Learning

From the global workspace perspective, we expect consciousness to be involved in learning novel events. Such learning requires unpredictable communication patterns between modules; a feat only possible by way of widespread broadcasting. 

Consider the radical simplicity of the act of learning itself. To learn anything new, we merely pay attention to it. By merely allowing ourselves to interact consciously with a new language – even without a learning plan, nor knowledge of its syntactic structure – we nevertheless “magically” acquire the ability to comprehend and speak.

Today we explored the relationship between learning, and the habituation of awareness. Baars says it best,

Habituation is not an accidental by-product of learning. Rather, it is something essential, connected at the very core to the acquisition of new information. And since learning and adaptation are perhaps the most basic functions of the nervous system, the connection between consciousness, habituation, and learning is fundamental indeed.

Factoring in our observations about error and curiosity, it seems as though learning can be modeled as a push-pull system. Learning promotes habituation, error promotes deautomization, and curiosity redirects the brain to different activities if the current one has been mastered.

The learning-surprise versus curiosity systems bears a striking resemblance to the reinforcement learning dichotomy of exploitation versus exploration. 

Towards The Future

I noted in Function of the Basal Ganglia that habituation has been associated with control shifting from the associative to the sensorimotor loop in the basal ganglia. This is hard to reconcile with the neurological basis of consciousness in the corticothalamic system. A more systematic account of these biological interactions is required. 

Consciousness has been linked to many other functions besides learning and habituation. It is most natural to interpret polyfunctional biological systems like this to have accreted function across evolutionary time. Untangling the phylogenetic ordering of these subfunctions (peeling the onion) is an important task that will require input from comparative anatomy.

The consciousness organ is not the only system to exhibit redundancy effects. Habituation to repeated input is a universal property of neural tissue. Even a single neuron will respond to electrical stimulation at a given frequency only for a while; after that, it will cease responding to the original frequency, but continue to respond to other frequencies. (Kaidel et al (1960). The relationship between the specific corticothalamic system and these microproperties of neurons is also an open research area.

Until next time. 

References

  • Baars (1998), A Cognitive Theory of Consciousness, especially sections 1.2.4, 1.3.3, 1.4.1, 1.4.4, and 3
  • Pani (1982). A functionalist approach to mental imagery.
  • Pritchard et al (1960). Visual perception approached by the method of stabilized images.
  • Kaidel et al (1960). Sensory Communication (pp 319-338).
  • Natsoulas (1982). Dimensions of perceptual awareness.

[Excerpt] Language vs Communication

Part Of: Language sequence.
Excerpt From: Tecumseh Fitch, The Evolution of Language
Content Summary: 800 words, 4 min read

What kind of sound does a dog make? That depends on which language you speak. Dogs are said to go ouah ouah in French, but ruff or woof in English. 

Crucially, however, the sounds that the dogs themselves make do not vary in this way. Dogs growl, whine, bark, howl and pant in the same way all over the world. This is because such sounds are part of the innate behavioral repertoire that every dog is born with. This basic vocal repertoire will be present even in a deaf and blind dog. This is not, of course, to say that dog sounds do not vary: they do. You may be able to recognize the bark of your own dog, as an individual, and different dog breed produce recognizably different vocalizations. But such differences are not learned; they are the inevitable byproducts of the fact that individuals vary, and  differences at the morphological, neural or “personality” level will have an influence on the sounds an individual makes. Dogs do not learn how to bark or growl, cats do not learn how to meow, and cows do not learn their individual “moos”. Such calls constitute an innate call system. By “innate” in this context, I simply mean “reliably developing without acoustic input from others” or canalized. For example, in experiments where young squirrel monkeys were raised by muted mothers, and never heard conspecific vocalizations, they nevertheless produced the full range of calls. 

The same regularity applies to important aspects of human communication. A smile is a smile all over the world, and a frown or grimace of disgust indicates displeasure everywhere. Not only are many facial expressions equivalent in all humans, but their interpretation is as well. Many vocal expressions are equally universal. Such vocalizations as laughter, sobbing, screaming, and groans of pain or pleasure are just as innately determined as the facial expressions that normally accompany them. Babies born both deaf and blind, unable to perceive either facial or vocal signals in their environment, nonetheless smile, laugh , frown, and cry normally. Again, just as for dog barking, individuals vary, and you may well recognize the laugh of a particular friend echoing above the noise in a crowded room. And we have some volitional control over our laughter: we can (usually) inhibit socially inappropriate laughter. These vocalizations form an innate human call system. Just like other animals, we have a species-specific, innate set of vocalizations, biologically associated with particular emotional and referential states. In contrast, we must learn the words or signs of language. 

This difference between human innate calls, like laughter and crying, and learned vocalizations, like speech and song, is fundamental (even down to the level of neural circuitry). An anencephalic human baby (entirely lacking a forebrain) still produces normal crying behavior but will never learn to speak or sing. In aphasia, speech is often lost while laughter and crying remain normal. Innate human calls provide an intuitive framework for understanding a core distinction between language and most animal signals, which are more like the laughs and cries of our own species than like speech. Laughs and cries are unlearned signals with meanings tied to important biological functions. To accept this fact is not to deny their communicative power. Innate calls can be very expressive and rich – indeed their affective power may be directly correlated with their unlearned nature. The “meaning” of a laugh can range from good-natural conviviality to scornful, derisive exclusion, just as a cat’s meow might “mean” she wants to go out, she wants food, or she wants to be petted. Insightful observers of animals and man have recognized these fundamental facts for many years. 

Obviously, signals of emotion and signals of linguistic meaning are not always neatly separable. In vocal prosodic cues, facial expressions, and gestures, our linguistic utterances are typically accompanied by “non-verbal” cues to how we feel about what we are saying. One signal typically carries both semantic information intelligible only to those who know the language, and a more basic set of information that can be understood by any human being or even other animals. Non-verbal expressive cues are invaluable to the child learning language, helping to coordinate joint attention and disambiguate the message and context. They also make spoken utterances more expressive than a written transcription alone. Other than the exclamation mark or emoticons, our tools to transcribe the expressive component are limited, but the ease and eagerness with which humans read illustrates that we can nonetheless understand language without this expressive component. This too, reinforces the value of a distinction between two parallel, complementary systems. 

As we discuss other animals’ communication systems, I invite the reader to compare these systems not only to language exchanges, but also to the last time you had a good laugh with a group of friends, and the warm feeling that goes along with it, or the sympathetic emotions summoned by seeing someone else cry, scream, or groan in pain. The question we must ask – “is this call type more like human laughter and crying, or more like speech or song?”. I will shortly argue that all non-human communication systems fall in the former category. 

Intro to Continental Drift

Part Of: Biology sequence
Content Summary: 1500 words, 15 min read

Continental Drift

Every school child recognizes that the shape of Africa and South America “match” with one another, like puzzle pieces. Trained geologist Alfred Wegener went further, and showed that not only the shape of the continents match, but a beach in South America often was more similar to its “counterpart” in Africa than it was to adjacent beaches along the coastline. On the basis of such data, he proposed continental drift. Africa and South America had once been neighbors, but spread apart over the course of Earth’s history. 

The theory of continental drift was initially controversial. The evidence of continental drift was there, yet geologists were unconvinced because they could not conceive of a mechanism: a physical process that might cause entire continents to move. The Expanding Earth hypothesis held that continental drift was an artifact of an expanding earth, with oceans “filling in the gaps” between the continents – but these conjectures were never formalized, nor did they receive experimental support. Eventually, however, powerful evidence led tectonic theory to emerge as the mechanism powering continental drift. 

Let’s turn our attention to tectonic theory.

Tectonic Theory

The very first evidence of tectonic theory came from expeditions to map the ocean floor. These revealed enormous mountain ranges that ran down precisely the middle of the Atlantic Ocean (among other places). Those mountain ranges were later discovered to be volcanically active. Deep trenches were also discovered around this time.

It was at this time realized that the volcanoes and trenches comprised a kind of conveyor belt system, with two complementary mechanisms for creating and retiring crust. Just as wooden blocks are pulled apart when placed in boiling water, continents are pulled apart by convection currents generated by the Earth’s core.

Our ability to use seismographs to record earthquakes was also maturing. You may have heard of the Ring of Fire; a shape along the edge of the Pacific Ocean with higher susceptibility to Earthquakes (I’m looking at you, San Francisco). If you look at a more complete distribution of earthquakes, you can begin to see shapes emerging. 

These tiles are tectonic plates. Here is a higher resolution image of plate boundaries. 

Tectonic plates are not hypotheses; they are physical objects with a history. A good way to appreciate this is by understanding that we can use earthquake measurement instruments to see into the Earth’s interior, in a process not unlike echolocation. Such techniques have been used to figure out the diameter of Earth’s core. They have also revealed fully submerged tectonic plates. These include the Farallon Plate underneath North America, which has not yet been fully reabsorbed by the surrounding mantle.

When rocks are created, they are hot enough to receive an imprint of the Earth’s magnetic field. During WW2, submarines noticed the seafloor was striped: at one location, the magnetic field was pointed North; move a few kilometers west, and the field pointed South. Why the stripes? Separately, evidence emerged for geomagnetic reversals: during the last 83 million years, the Earth’s magnetic field has reversed 183 times. Continental drift and geomagnetic reversals explain the magnetic stripes.

Volcanos, Mountains, and Cratons

Volcanoes promote seafloor spreading. But not all volcanoes exist at crust boundaries. Consider Hawaii. The fifteen volcanoes that make up the eight islands of Hawaii are the youngest in a chain of more than 129 volcanoes in the Hawaiian-Emperor seamount chain. Note the “V” shaped pattern.

Why is there such a long chain of dormant volcanoes connected to the active volcanoes in Hawaii? The most common explanation is mantle plumes, caused by processes of Rayleigh–Taylor instability. Tectonic plates drag oceanic over these plume-based hotspots, which “poke holes” into the lithosphere. Like fabric in a sewing machine…

Why does this seamont chain “change direction”? Magnetic evidence shows the plates simply changed direction, some 40 million years ago. 

Volcanoes create new crust, which is carried along on a conveyor belt, for consumption in the trenches. In this sense, oceanic crust are perpetually being recycled, with the creation of new and the destruction of old crust occurring simultaneously. This explains two salient facts: oceanic crust is much younger and thinner than continental crust.

As a result of this conveyor-like motion, the thin and mobile oceanic crust slowly “squeezes” continental land mass. This is the basis of orogony, the science of mountain formation. It’s fun to think about, especially when you’re traveling across these tremendous landmarks…

Most mountains are the coastline. Why are the Himalayas closer to the interior? Rather than oceans exerting compressive force on Asia, the Himalayas were formed by the entire subcontinent of India slowly moving North, and ultimately engaging in a slow-motion collision with the Eurasian landmass.

Oceanic crust doesn’t last long before being recycled. Continental crust has a talent for persisting, and growing increasingly thick. Here, the concept of geological province may help. Some continent crust is truly ancient, massive, and deep: these are called cratons. It is primarily in these 2 billion year old rocks that we find kimberlite, the stuff that contains diamonds. 

The Supercontinent Cycle

As you can see, there is no room for doubt that continents move. Indeed, GPS is able to detect continental drift in real time (arrows represent the direction and magnitude of drift). 

Let’s regroup. We know how the continents are arranged today, and are able to infer some information about other time periods. How much information, exactly? 

Looking forward, we have a fairly good idea of what will happen during the next 50 million years. It doesn’t take a rocket scientist to look at the GPS data above and conclude that Africa will collide with Europe, for example. But even though continents move very slowly and predictably on human time scales, after a certain amount of time our models begin to outstrip our data.

Looking backward, we have much more data to use in reverse-engineering previous geological periods. For example, we’ve already seen coastline-matching and mountain ridge evidence suggesting that Africa and South America were adjacent during the reign of supercontinent Pangaea, some 300 million years ago. 

Here is a reasonably high-quality animation of continental drift.

So that’s Pangaea. We also have solid evidence for another supercontinent called Rodinia some 700 mya. 

Our understanding of these two supercontinents is surprisingly complete. However, our ability to reverse engineer the past becomes less precise (and correspondingly, more controversy-ridden) as you go far beyond 1000 mya. Nevertheless, most geologists think they can make out the existence of four more supercontinents in their data. 

Mineralogy provides some evidence for the existence of supercontinents: certain minerals are only formed during these periods. Below is the formation history of molybdenite; similar graphs exist for other minerals. Further, magnetic imprints and radiometry allow us to glimpse the spatial organization or extinct supercontinents.

Why so many supercontinents? Consider again the story of Pangea. It used to be happily unified. But then a transcontinental rift occurred, separating what is now Africa and South America. Rifts occur where convection currents pull apart a single plate, and the in-between land sinks; modern-day rifts exist under the Red Sea, and the East African Rift Valley. 

During a supercontinent, there is a single continuous clump of water. After a transcontinental rift, a new ocean forms inside the rift (in our case, the Atlantic). The Atlantic Ocean is growing, the Pacific Ocean is shrinking. This process will continue until the Pacific Ocean is no more. A single ocean, and a single supercontinent. This is the supercontinent cycle

Biogeography

Animals have been around for some 700 million years. Dead animals sometimes leave behind fossils. 

Species are composed of populations: members of a species that live adjacent to one another. These populations tend to occupy a continuous stretch of land. Why? Regardless of where speciation occurs, population have only one way to spread out: walking.

The distribution of fossils starts to make more sense when you recall where the continents existed when these organisms lived.

Continental drift doesn’t just shed light on extinct species, but also living ones. An example to whet your appetite:

Why are marsupial mammals largely confined to Australia? By the time placentas were invented, Australia had separated from Pangaea.

Finally, biogeography is a predictive science: we can use it to make predictions about where to find fossils. Paleontolists can and do consult with geologists to figure out where to look in today’s rocks to find yesterday’s animals. Did you think the transitional form Tiktaalik was found by accident? No: Neil Shubin looked at species on either side of the “gap”, and reviewed when & where they lived. He then interpolated the spacetime location of the transitional forms. By “replayed the clock” to figure out where on contemporary Earth to look: his team went to look at specific strata in Swedish rocks, and the rest is history.

In closing, a few interesting parallels between geology and biology are worth noting:

  • Tectonic theory explains continental drift, just as natural selection explains common descent. 
  • More recently, we can directly observe continental drift with GPS data; just as we can now directly observe speciation. 

Takeaways

  • Continents move. South America used to be connected to Africa. We can even see continents drift using GPS.
  • Continental drift occurs because the Earth’s core is hot, and it creates convection currents that push on tectonic plates above.
  • Mountains are created by plates squishing into one another. 
  • Some volcanos are formed by mantle plumes “poking holes” into otherwise solid plates.
  • Ocean crust is “recycled” relatively rapidly. Continental crust persists as cratons.
  • Continents regularly coalesce into supercontinents, then dispersal by a transcontinental rift. This is the supercontinent cycle
  • The history of life coexists with continental drift. We can use the Earth sciences to constrain our knowledge of common descent.

Intro to Regularization

Part Of: Machine Learning sequence
Followup To: Bias vs Variance, Gradient Descent
Content Summary: 1100 words, 11 min read

In Intro to Gradient Descent, we discussed how loss functions allow optimization methods to locate high-performance models.

But in Bias vs Variance, we discussed how model performance isn’t the only thing that matters. Simplicity promotes generalizability.

One way to enhance simplicity is to receive the model discovered by gradient descent, and manually remove unnecessary parameters.

But we can do better. In order to automate parsimony, we can embed our preference for simplicity into the loss function itself.

But first, we need to quantify our intuitions about complexity.

Formalizing Complexity

Neural networks are often used as classification models against large numbers of images. The complexity of the models tends to correlate with the number of layers. For some models then, complexity is captured in the number of parameters.

While not used much in the industry, polynomial models are pedagogically useful examples of regression models. Here, the degree of the polynomial expresses the complexity of the model: a degree-eight polynomial has more “bumps” than a degree-two polynomial.

Consider, however, the difference between the following regression models

y_A = 4x^4 + 0.0001x^3 + 0.0007x^2 + 2.1x + 7

y_B = 4x^4 + 2.1x + 7

Model A uses five parameters; Model B uses three. But their predictions are, for all practical purposes, identical. Thus, the size of each parameter is also relevant to the question of complexity.

The above approaches rely on the model’s parameters (its “visceral organs”) to define complexity. But it is also possible to rely on the model’s outputs (its “behaviors”) to achieve the same task. Consider again the classification decision boundaries above. We can simply measure the spatial frequency (the “squiggliness” of the boundary) as another proxy towards complexity.

Here, then, are three possible criteria for complexity:

  1. Number of parameters
  2. Size of parameters
  3. Spatial frequency of decision manifold

Thus, operationalizing the definition of “complexity” is surprisingly challenging.

Mechanized Parsimony

Recall our original notion of the performance-complexity quadrant. By defining our loss function exclusively in terms of the residual error, gradient descent learns to prefer accurate models (to “move upward”). Is there a way to induce leftward movement as well?

To have gradient descent respond to both criteria, we can embed them into the loss function. One simple way to accomplish this: addition.

This technique is an example of regularization.

Depending on the application, sometimes the errors are much larger than the parameters or vice versa. In order to assure the right balance between these terms, people usually add a hyperparameter to the regularized loss function J = \|e\|_2 + \lambda \|\theta\|_2

A Geometric Interpretation

Recall Einstein’s insight that gravity is curvature of spacetime. You can envision such curvature as a ball pulling on a sheet. Here is the gravity well of bodies of the solar system:

Every mass pulls on every other mass! Despite the appearance of the above, Earth does “pull on” Saturn.

The unregularized cost function we saw last time creates a convex loss function, which we’ll interpret as a gravity well centered around parameters of best fit. If we replace J with a function that only penalizes complexity, a corresponding gravity well appears, centered around parameters of zero size.

If we keep both terms, we see the loss surface now has two enmeshed gravity wells. If scaled appropriately, the “zero attractor” will pull the most performant solution (here \theta = (8,7) towards a not-much-worse yet simpler model \theta = (4,5).

More on L1 vs L2

Previously, I introduced the L1 norm aka mean average error MAE

\|x\|_1 = (\sum_{i=1}^{n} \lvert x_i\rvert^1)^1

Another loss function is the L2 norm aka root mean squared error RMSE

\|x\|_2 = (\sum_{i=1}^{n} \lvert x_i\rvert^2)^{1/2}

The L1 and L2 norms respectively correspond to Euclidean vs Manhattan distance (roughly, plane vs car travel):

One useful way to view norms is by their isosurface. If you can travel in any direction for a finite amount of time, the isosurface is the frontier you might sketch.

The L2 isosurface is a circle. The L1 isosurface is a diamond.

  • If you don’t change direction, you can travel the “normal” L2 distance.
  • If you do change direction, your travel becomes inefficient (since “diagonal” travel along the hypotenuse is forbidden).

The Lp Norm as Superellipse

Consider again the formulae for the L1 and L2 norm. We can generalize these as special cases of the Lp norm:

\|x\|_p = (\sum_{i=1}^{n} \lvert x_i\rvert^p)^{1/p}

Here are isosurfaces of six exemplars of this norm family:

On inspection, the above image looks like a square that’s inflating with increasing p. In fact, the Lp norm generates a superellipse.

As an aside, note that the boundaries of the Lp norm family operationalize complexity rather “intuitively”. For the L0 norm, complexity is the number of non-zero parameters. For the Linf norm, complexity is the size of the largest parameter.

Lasso vs Ridge Regression

Why the detour into geometry?

Well, so far, we’ve expressed regularization as J  = \|e\|_p + \lambda \| \theta \|_p But most engineers choose between the L1 and L2 norms. The L1 norm is not convex (bowl shaped), which tends to make gradient descent more difficult. But the L1 norm is also more robust to outliers, and has other benefits.

Here are two options for the residual norm:

  • \|e\|_2: sensitive to outliers, but a stable solution
  • \|e\|_1: robust to outlier, but an unstable solution

The instability of \|e\|_1 tends to be particularly thorny in practice, so $latex \|e\|_2$ is almost always chosen.

That leaves us with two remaining choices:

  • Ridge Regression: J =  \|e\|_2  + \|\theta\|_2 : computationally inefficient, but sparse output.
  • Lasso Regression: J =  \|e\|_2  + \|\theta\|_1: computationally efficient, non-sparse output

What does sparse output mean? For a given model type, say y = ax^3 + bx^2 + cx + d with parameters (a, b, c, d), Ridge regression might output parameters (3, 0.5, 7.8, -0.4) whereas Lasso might give me (3, 0, 7.8, 0) . In effect, Ridge regression is performing feature selection: locating parameters that can be safely removed. Why should this be?

Geometry to the rescue!

In ridge regression, both gravity wells have convex isosurfaces. Their compromises are reached anywhere in the loss surface. In lasso regression, the diamond-shaped complexity isosurface tends to push compromises towards axes where \theta_i = 0. (In higher dimensions, the same geometry applies).

Both Ridge and Lasso regression are used in practice. The details of your application should influence your choice. I’ll also note in passing that “compromise algorithms” like Elastic Net exist, that tries to capture the best parts of either algorithm.

Takeaways

I hope you enjoyed this whirlwind tour of regularization. For a more detailed look at ridge vs lasso, I recommend reading this.

Until next time.