Consciousness as a Learning Device

Part Of: Consciousness sequence
Content Summary: 1600 words, 16 min read
Inspiration: Baars (1998) A Cognitive Theory of Consciousness.

Automatization in Tasks

Almost everything we do, we do better unconsciously than consciously. In first learning a new skill we fumble, feel uncertain, and are conscious of many details of action. Once the task is learned, we lose consciousness of the details, forget the painful encounter with uncertainty, and sincerely wonder why beginners seem so slow and awkward. 

In dual task paradigms, subjects are asked to perform two tasks simultaneously. Performance is often poor, because of the limited capacity of consciousness. But when a subject extensively practices one of these tasks, the task will stop interfering with others, and performance improves.

Consider reading, the act of translating visual letters into conceptual meaning. Reading proceeds automatically. If you see the word “pink”, it is nearly impossible to avoid subvocalizing and imagining the color (inner speech and semantic recall). You are not aware of identifying individual letters, or searching your memory for the requisite sounds and meanings – they just occur.

Driving a car is yet another example of a skill that becomes automatic:

When we first learn to drive a car, we are very conscious of the steering wheel, the transmission lever, the foot pedals, and so on. But once having learned to drive, we minimize consciousness of these things and become mainly concerned with the road, with turns in the road, traffic to cope with, and pedestrians to evade. The mechanics of driving become part of the unconscious frames within which we experience the road. 

But even the road can be learned to the point of minimal conscious involvement if it is predictable enough: then we devote most of our consciousness to thinking of different destinations, of long-term goals, and so forth. The road has itself now become “framed”. The whole process is much like Alice moving through the Looking Glass, entering a new reality, and forgetting for the time being that it is not the only reality. Things that were previously conscious become presupposed in the new reality. In fact, tools and subgoals in general become framed as they become predictable and automatic.

Why, when the act of driving becomes automatic, do we become conscious of the road? Presumably the road is much more informative within our purposes than driving has become. Dodging another car, turning a blind corer, braking for a pedestrian – these are much less predictable than the handling of the steering wheel. 

The process of automatizing a skill is called habituation. Habituation involves an increase in performance and a decrease in demand for cognitive resources. But it also involves:

  • loss of self-monitoring: an unpracticed beginner is aware of their own performance, but an expert practitioner can be deceived into believing her performance was much less than its actual value.
  • loss of long-term working memory. Consider, in typing, which finger is used to type the letter c? Most people have to consult their fingers to find out the answer

Suppose someone is given a shape from among the following set, and asked to memorize it. They then receive pairs of other images, and select which one is more similar. 

Pani (1982) found that, as subjects practiced the task, the original image faded from consciousness even as the responses became faster and more accurate. 

Automatization in Perception

The Pani experiment suggests that not merely actions that move to autopilot. Perception can fade from consciousness as well.

Consider the pressure of the chair you are sitting in. Before I mentioned it, that tactile sensation had likely faded into the background. In contrast, the visual experience of reading these words was very much at the center of your conscious experience. 

What is the difference between the tactile quality of the chair and the visual experience of these words?  Redundancy! The chair feels very similar one moment to the next, whereas each new word has a subtly different experience. 

These redundancy effects are pervasive. Consider the experience of moving to an area with a distinctive smell. For the first few days, the smell is at the forefront of your conscious experience; but over time, this redundant sensation fades to the background.

We have seen redundant touch and smell fade from consciousness. Why don’t we become blind to redundant visual information?

Unlike touch and smell, our fovea constantly move across the visual field in an involuntary movements called saccades. This might be one way that the visual system combats redundancy.

If you mount a tiny projector on a contact lens firmly attached to the eye, you can ensure that the visual image is invariant to eye movements. Pritchard et al (1960) found that in such conditions, the visual image fades in a few seconds. Similarly, when people look at a bright but featureless field (the Ganzfeld), they experience “blank outs” – periods when visual perception seems to fade altogether. (Natsoulas, 1982). When vision is not protected by saccades, it behaves just like the other senses.

Becoming blind to redundant information is not limited to perception. Semantic satiation occurs when a person repeats the same word over and over again, until the word starts to feel foreign and arbitrary. Try this for yourself, say “gum” to yourself 50 times and see what happens. 

There is a school of thought that interprets these redundancy effects as anatomical fatigue (perhaps processing the same image dozens of times exhausts neurotransmitters in the relevant microcircuits). But these interpretations are confounded by our ability to surprised by the lack of a stimulus, which implies that the redundancy is encoded in terms of information rather than energy.

It is also worth noting that redundant perceptions do not fade into the background if they are highly relevant to the organism’s health and goals. Chronic pain and hunger fall under this rubric. These are, however, exceptions to the rule. 

Errors and Curiosity

When we experience difficulty performing automatized tasks, consciousness access returns.

  • In reading, lexical access becomes automatic. But simply turning a book upside down will interfere with our reading proficiency, and the perceptual details of “stitching letters to form words” comes back to us.
  • In visual matching, our ability to describe the original target image disappears as we become proficient. But by simply increasing task complexity, our ability to describe the target image returns.
  • In driving, if we move to a new city, our routing autopilot procedures evaporate, and we are more conscious of navigational decisions. If we buy a new car with different operating characteristics (a more sensitive brake pedal, and less sensitive steering control), the mechanical details of driving flood back into our consciousness. 

It seems that consciousness is used to debug automatic processes that run into difficulties.

We often tire of practicing tasks that we have mastered. We often tire of receiving sense data we can fully anticipate. In the case where our brain has fully habituated to some phenomena (and indeed, often before that point is reached), curiosity moves our attention towards other domains. This impulse towards novelty is one way our brain builds a diverse coalition of mental modules capable of responding to an intrinsically complicated world.

Towards A Theory of Conscious Learning

From the global workspace perspective, we expect consciousness to be involved in learning novel events. Such learning requires unpredictable communication patterns between modules; a feat only possible by way of widespread broadcasting. 

Consider the radical simplicity of the act of learning itself. To learn anything new, we merely pay attention to it. By merely allowing ourselves to interact consciously with a new language – even without a learning plan, nor knowledge of its syntactic structure – we nevertheless “magically” acquire the ability to comprehend and speak.

Today we explored the relationship between learning, and the habituation of awareness. Baars says it best,

Habituation is not an accidental by-product of learning. Rather, it is something essential, connected at the very core to the acquisition of new information. And since learning and adaptation are perhaps the most basic functions of the nervous system, the connection between consciousness, habituation, and learning is fundamental indeed.

Factoring in our observations about error and curiosity, it seems as though learning can be modeled as a push-pull system. Learning promotes habituation, error promotes deautomization, and curiosity redirects the brain to different activities if the current one has been mastered.

The learning-surprise versus curiosity systems bears a striking resemblance to the reinforcement learning dichotomy of exploitation versus exploration. 

Towards The Future

I noted in Function of the Basal Ganglia that habituation has been associated with control shifting from the associative to the sensorimotor loop in the basal ganglia. This is hard to reconcile with the neurological basis of consciousness in the corticothalamic system. A more systematic account of these biological interactions is required. 

Consciousness has been linked to many other functions besides learning and habituation. It is most natural to interpret polyfunctional biological systems like this to have accreted function across evolutionary time. Untangling the phylogenetic ordering of these subfunctions (peeling the onion) is an important task that will require input from comparative anatomy.

The consciousness organ is not the only system to exhibit redundancy effects. Habituation to repeated input is a universal property of neural tissue. Even a single neuron will respond to electrical stimulation at a given frequency only for a while; after that, it will cease responding to the original frequency, but continue to respond to other frequencies. (Kaidel et al (1960). The relationship between the specific corticothalamic system and these microproperties of neurons is also an open research area.

Until next time. 

References

  • Baars (1998), A Cognitive Theory of Consciousness, especially sections 1.2.4, 1.3.3, 1.4.1, 1.4.4, and 3
  • Pani (1982). A functionalist approach to mental imagery.
  • Pritchard et al (1960). Visual perception approached by the method of stabilized images.
  • Kaidel et al (1960). Sensory Communication (pp 319-338).
  • Natsoulas (1982). Dimensions of perceptual awareness.

[Excerpt] Language vs Communication

Part Of: Language sequence.
Excerpt From: Tecumseh Fitch, The Evolution of Language
Content Summary: 800 words, 4 min read

What kind of sound does a dog make? That depends on which language you speak. Dogs are said to go ouah ouah in French, but ruff or woof in English. 

Crucially, however, the sounds that the dogs themselves make do not vary in this way. Dogs growl, whine, bark, howl and pant in the same way all over the world. This is because such sounds are part of the innate behavioral repertoire that every dog is born with. This basic vocal repertoire will be present even in a deaf and blind dog. This is not, of course, to say that dog sounds do not vary: they do. You may be able to recognize the bark of your own dog, as an individual, and different dog breed produce recognizably different vocalizations. But such differences are not learned; they are the inevitable byproducts of the fact that individuals vary, and  differences at the morphological, neural or “personality” level will have an influence on the sounds an individual makes. Dogs do not learn how to bark or growl, cats do not learn how to meow, and cows do not learn their individual “moos”. Such calls constitute an innate call system. By “innate” in this context, I simply mean “reliably developing without acoustic input from others” or canalized. For example, in experiments where young squirrel monkeys were raised by muted mothers, and never heard conspecific vocalizations, they nevertheless produced the full range of calls. 

The same regularity applies to important aspects of human communication. A smile is a smile all over the world, and a frown or grimace of disgust indicates displeasure everywhere. Not only are many facial expressions equivalent in all humans, but their interpretation is as well. Many vocal expressions are equally universal. Such vocalizations as laughter, sobbing, screaming, and groans of pain or pleasure are just as innately determined as the facial expressions that normally accompany them. Babies born both deaf and blind, unable to perceive either facial or vocal signals in their environment, nonetheless smile, laugh , frown, and cry normally. Again, just as for dog barking, individuals vary, and you may well recognize the laugh of a particular friend echoing above the noise in a crowded room. And we have some volitional control over our laughter: we can (usually) inhibit socially inappropriate laughter. These vocalizations form an innate human call system. Just like other animals, we have a species-specific, innate set of vocalizations, biologically associated with particular emotional and referential states. In contrast, we must learn the words or signs of language. 

This difference between human innate calls, like laughter and crying, and learned vocalizations, like speech and song, is fundamental (even down to the level of neural circuitry). An anencephalic human baby (entirely lacking a forebrain) still produces normal crying behavior but will never learn to speak or sing. In aphasia, speech is often lost while laughter and crying remain normal. Innate human calls provide an intuitive framework for understanding a core distinction between language and most animal signals, which are more like the laughs and cries of our own species than like speech. Laughs and cries are unlearned signals with meanings tied to important biological functions. To accept this fact is not to deny their communicative power. Innate calls can be very expressive and rich – indeed their affective power may be directly correlated with their unlearned nature. The “meaning” of a laugh can range from good-natural conviviality to scornful, derisive exclusion, just as a cat’s meow might “mean” she wants to go out, she wants food, or she wants to be petted. Insightful observers of animals and man have recognized these fundamental facts for many years. 

Obviously, signals of emotion and signals of linguistic meaning are not always neatly separable. In vocal prosodic cues, facial expressions, and gestures, our linguistic utterances are typically accompanied by “non-verbal” cues to how we feel about what we are saying. One signal typically carries both semantic information intelligible only to those who know the language, and a more basic set of information that can be understood by any human being or even other animals. Non-verbal expressive cues are invaluable to the child learning language, helping to coordinate joint attention and disambiguate the message and context. They also make spoken utterances more expressive than a written transcription alone. Other than the exclamation mark or emoticons, our tools to transcribe the expressive component are limited, but the ease and eagerness with which humans read illustrates that we can nonetheless understand language without this expressive component. This too, reinforces the value of a distinction between two parallel, complementary systems. 

As we discuss other animals’ communication systems, I invite the reader to compare these systems not only to language exchanges, but also to the last time you had a good laugh with a group of friends, and the warm feeling that goes along with it, or the sympathetic emotions summoned by seeing someone else cry, scream, or groan in pain. The question we must ask – “is this call type more like human laughter and crying, or more like speech or song?”. I will shortly argue that all non-human communication systems fall in the former category. 

Intro to Continental Drift

Part Of: Biology sequence
Content Summary: 1500 words, 15 min read

Continental Drift

Every school child recognizes that the shape of Africa and South America “match” with one another, like puzzle pieces. Trained geologist Alfred Wegener went further, and showed that not only the shape of the continents match, but a beach in South America often was more similar to its “counterpart” in Africa than it was to adjacent beaches along the coastline. On the basis of such data, he proposed continental drift. Africa and South America had once been neighbors, but spread apart over the course of Earth’s history. 

The theory of continental drift was initially controversial. The evidence of continental drift was there, yet geologists were unconvinced because they could not conceive of a mechanism: a physical process that might cause entire continents to move. The Expanding Earth hypothesis held that continental drift was an artifact of an expanding earth, with oceans “filling in the gaps” between the continents – but these conjectures were never formalized, nor did they receive experimental support. Eventually, however, powerful evidence led tectonic theory to emerge as the mechanism powering continental drift. 

Let’s turn our attention to tectonic theory.

Tectonic Theory

The very first evidence of tectonic theory came from expeditions to map the ocean floor. These revealed enormous mountain ranges that ran down precisely the middle of the Atlantic Ocean (among other places). Those mountain ranges were later discovered to be volcanically active. Deep trenches were also discovered around this time.

It was at this time realized that the volcanoes and trenches comprised a kind of conveyor belt system, with two complementary mechanisms for creating and retiring crust. Just as wooden blocks are pulled apart when placed in boiling water, continents are pulled apart by convection currents generated by the Earth’s core.

Our ability to use seismographs to record earthquakes was also maturing. You may have heard of the Ring of Fire; a shape along the edge of the Pacific Ocean with higher susceptibility to Earthquakes (I’m looking at you, San Francisco). If you look at a more complete distribution of earthquakes, you can begin to see shapes emerging. 

These tiles are tectonic plates. Here is a higher resolution image of plate boundaries. 

Tectonic plates are not hypotheses; they are physical objects with a history. A good way to appreciate this is by understanding that we can use earthquake measurement instruments to see into the Earth’s interior, in a process not unlike echolocation. Such techniques have been used to figure out the diameter of Earth’s core. They have also revealed fully submerged tectonic plates. These include the Farallon Plate underneath North America, which has not yet been fully reabsorbed by the surrounding mantle.

When rocks are created, they are hot enough to receive an imprint of the Earth’s magnetic field. During WW2, submarines noticed the seafloor was striped: at one location, the magnetic field was pointed North; move a few kilometers west, and the field pointed South. Why the stripes? Separately, evidence emerged for geomagnetic reversals: during the last 83 million years, the Earth’s magnetic field has reversed 183 times. Continental drift and geomagnetic reversals explain the magnetic stripes.

Volcanos, Mountains, and Cratons

Volcanoes promote seafloor spreading. But not all volcanoes exist at crust boundaries. Consider Hawaii. The fifteen volcanoes that make up the eight islands of Hawaii are the youngest in a chain of more than 129 volcanoes in the Hawaiian-Emperor seamount chain. Note the “V” shaped pattern.

Why is there such a long chain of dormant volcanoes connected to the active volcanoes in Hawaii? The most common explanation is mantle plumes, caused by processes of Rayleigh–Taylor instability. Tectonic plates drag oceanic over these plume-based hotspots, which “poke holes” into the lithosphere. Like fabric in a sewing machine…

Why does this seamont chain “change direction”? Magnetic evidence shows the plates simply changed direction, some 40 million years ago. 

Volcanoes create new crust, which is carried along on a conveyor belt, for consumption in the trenches. In this sense, oceanic crust are perpetually being recycled, with the creation of new and the destruction of old crust occurring simultaneously. This explains two salient facts: oceanic crust is much younger and thinner than continental crust.

As a result of this conveyor-like motion, the thin and mobile oceanic crust slowly “squeezes” continental land mass. This is the basis of orogony, the science of mountain formation. It’s fun to think about, especially when you’re traveling across these tremendous landmarks…

Most mountains are the coastline. Why are the Himalayas closer to the interior? Rather than oceans exerting compressive force on Asia, the Himalayas were formed by the entire subcontinent of India slowly moving North, and ultimately engaging in a slow-motion collision with the Eurasian landmass.

Oceanic crust doesn’t last long before being recycled. Continental crust has a talent for persisting, and growing increasingly thick. Here, the concept of geological province may help. Some continent crust is truly ancient, massive, and deep: these are called cratons. It is primarily in these 2 billion year old rocks that we find kimberlite, the stuff that contains diamonds. 

The Supercontinent Cycle

As you can see, there is no room for doubt that continents move. Indeed, GPS is able to detect continental drift in real time (arrows represent the direction and magnitude of drift). 

Let’s regroup. We know how the continents are arranged today, and are able to infer some information about other time periods. How much information, exactly? 

Looking forward, we have a fairly good idea of what will happen during the next 50 million years. It doesn’t take a rocket scientist to look at the GPS data above and conclude that Africa will collide with Europe, for example. But even though continents move very slowly and predictably on human time scales, after a certain amount of time our models begin to outstrip our data.

Looking backward, we have much more data to use in reverse-engineering previous geological periods. For example, we’ve already seen coastline-matching and mountain ridge evidence suggesting that Africa and South America were adjacent during the reign of supercontinent Pangaea, some 300 million years ago. 

Here is a reasonably high-quality animation of continental drift.

So that’s Pangaea. We also have solid evidence for another supercontinent called Rodinia some 700 mya. 

Our understanding of these two supercontinents is surprisingly complete. However, our ability to reverse engineer the past becomes less precise (and correspondingly, more controversy-ridden) as you go far beyond 1000 mya. Nevertheless, most geologists think they can make out the existence of four more supercontinents in their data. 

Mineralogy provides some evidence for the existence of supercontinents: certain minerals are only formed during these periods. Below is the formation history of molybdenite; similar graphs exist for other minerals. Further, magnetic imprints and radiometry allow us to glimpse the spatial organization or extinct supercontinents.

Why so many supercontinents? Consider again the story of Pangea. It used to be happily unified. But then a transcontinental rift occurred, separating what is now Africa and South America. Rifts occur where convection currents pull apart a single plate, and the in-between land sinks; modern-day rifts exist under the Red Sea, and the East African Rift Valley. 

During a supercontinent, there is a single continuous clump of water. After a transcontinental rift, a new ocean forms inside the rift (in our case, the Atlantic). The Atlantic Ocean is growing, the Pacific Ocean is shrinking. This process will continue until the Pacific Ocean is no more. A single ocean, and a single supercontinent. This is the supercontinent cycle

Biogeography

Animals have been around for some 700 million years. Dead animals sometimes leave behind fossils. 

Species are composed of populations: members of a species that live adjacent to one another. These populations tend to occupy a continuous stretch of land. Why? Regardless of where speciation occurs, population have only one way to spread out: walking.

The distribution of fossils starts to make more sense when you recall where the continents existed when these organisms lived.

Continental drift doesn’t just shed light on extinct species, but also living ones. An example to whet your appetite:

Why are marsupial mammals largely confined to Australia? By the time placentas were invented, Australia had separated from Pangaea.

Finally, biogeography is a predictive science: we can use it to make predictions about where to find fossils. Paleontolists can and do consult with geologists to figure out where to look in today’s rocks to find yesterday’s animals. Did you think the transitional form Tiktaalik was found by accident? No: Neil Shubin looked at species on either side of the “gap”, and reviewed when & where they lived. He then interpolated the spacetime location of the transitional forms. By “replayed the clock” to figure out where on contemporary Earth to look: his team went to look at specific strata in Swedish rocks, and the rest is history.

In closing, a few interesting parallels between geology and biology are worth noting:

  • Tectonic theory explains continental drift, just as natural selection explains common descent. 
  • More recently, we can directly observe continental drift with GPS data; just as we can now directly observe speciation. 

Takeaways

  • Continents move. South America used to be connected to Africa. We can even see continents drift using GPS.
  • Continental drift occurs because the Earth’s core is hot, and it creates convection currents that push on tectonic plates above.
  • Mountains are created by plates squishing into one another. 
  • Some volcanos are formed by mantle plumes “poking holes” into otherwise solid plates.
  • Ocean crust is “recycled” relatively rapidly. Continental crust persists as cratons.
  • Continents regularly coalesce into supercontinents, then dispersal by a transcontinental rift. This is the supercontinent cycle
  • The history of life coexists with continental drift. We can use the Earth sciences to constrain our knowledge of common descent.

Intro to Regularization

Part Of: Machine Learning sequence
Followup To: Bias vs Variance, Gradient Descent
Content Summary: 1100 words, 11 min read

In Intro to Gradient Descent, we discussed how loss functions allow optimization methods to locate high-performance models.

But in Bias vs Variance, we discussed how model performance isn’t the only thing that matters. Simplicity promotes generalizability.

One way to enhance simplicity is to receive the model discovered by gradient descent, and manually remove unnecessary parameters.

But we can do better. In order to automate parsimony, we can embed our preference for simplicity into the loss function itself.

But first, we need to quantify our intuitions about complexity.

Formalizing Complexity

Neural networks are often used as classification models against large numbers of images. The complexity of the models tends to correlate with the number of layers. For some models then, complexity is captured in the number of parameters.

While not used much in the industry, polynomial models are pedagogically useful examples of regression models. Here, the degree of the polynomial expresses the complexity of the model: a degree-eight polynomial has more “bumps” than a degree-two polynomial.

Consider, however, the difference between the following regression models

y_A = 4x^4 + 0.0001x^3 + 0.0007x^2 + 2.1x + 7

y_B = 4x^4 + 2.1x + 7

Model A uses five parameters; Model B uses three. But their predictions are, for all practical purposes, identical. Thus, the size of each parameter is also relevant to the question of complexity.

The above approaches rely on the model’s parameters (its “visceral organs”) to define complexity. But it is also possible to rely on the model’s outputs (its “behaviors”) to achieve the same task. Consider again the classification decision boundaries above. We can simply measure the spatial frequency (the “squiggliness” of the boundary) as another proxy towards complexity.

Here, then, are three possible criteria for complexity:

  1. Number of parameters
  2. Size of parameters
  3. Spatial frequency of decision manifold

Thus, operationalizing the definition of “complexity” is surprisingly challenging.

Mechanized Parsimony

Recall our original notion of the performance-complexity quadrant. By defining our loss function exclusively in terms of the residual error, gradient descent learns to prefer accurate models (to “move upward”). Is there a way to induce leftward movement as well?

To have gradient descent respond to both criteria, we can embed them into the loss function. One simple way to accomplish this: addition.

This technique is an example of regularization.

Depending on the application, sometimes the errors are much larger than the parameters or vice versa. In order to assure the right balance between these terms, people usually add a hyperparameter to the regularized loss function J = \|e\|_2 + \lambda \|\theta\|_2

A Geometric Interpretation

Recall Einstein’s insight that gravity is curvature of spacetime. You can envision such curvature as a ball pulling on a sheet. Here is the gravity well of bodies of the solar system:

Every mass pulls on every other mass! Despite the appearance of the above, Earth does “pull on” Saturn.

The unregularized cost function we saw last time creates a convex loss function, which we’ll interpret as a gravity well centered around parameters of best fit. If we replace J with a function that only penalizes complexity, a corresponding gravity well appears, centered around parameters of zero size.

If we keep both terms, we see the loss surface now has two enmeshed gravity wells. If scaled appropriately, the “zero attractor” will pull the most performant solution (here \theta = (8,7) towards a not-much-worse yet simpler model \theta = (4,5).

More on L1 vs L2

Previously, I introduced the L1 norm aka mean average error MAE

\|x\|_1 = (\sum_{i=1}^{n} \lvert x_i\rvert^1)^1

Another loss function is the L2 norm aka root mean squared error RMSE

\|x\|_2 = (\sum_{i=1}^{n} \lvert x_i\rvert^2)^{1/2}

The L1 and L2 norms respectively correspond to Euclidean vs Manhattan distance (roughly, plane vs car travel):

One useful way to view norms is by their isosurface. If you can travel in any direction for a finite amount of time, the isosurface is the frontier you might sketch.

The L2 isosurface is a circle. The L1 isosurface is a diamond.

  • If you don’t change direction, you can travel the “normal” L2 distance.
  • If you do change direction, your travel becomes inefficient (since “diagonal” travel along the hypotenuse is forbidden).

The Lp Norm as Superellipse

Consider again the formulae for the L1 and L2 norm. We can generalize these as special cases of the Lp norm:

\|x\|_p = (\sum_{i=1}^{n} \lvert x_i\rvert^p)^{1/p}

Here are isosurfaces of six exemplars of this norm family:

On inspection, the above image looks like a square that’s inflating with increasing p. In fact, the Lp norm generates a superellipse.

As an aside, note that the boundaries of the Lp norm family operationalize complexity rather “intuitively”. For the L0 norm, complexity is the number of non-zero parameters. For the Linf norm, complexity is the size of the largest parameter.

Lasso vs Ridge Regression

Why the detour into geometry?

Well, so far, we’ve expressed regularization as J  = \|e\|_p + \lambda \| \theta \|_p But most engineers choose between the L1 and L2 norms. The L1 norm is not convex (bowl shaped), which tends to make gradient descent more difficult. But the L1 norm is also more robust to outliers, and has other benefits.

Here are two options for the residual norm:

  • \|e\|_2: sensitive to outliers, but a stable solution
  • \|e\|_1: robust to outlier, but an unstable solution

The instability of \|e\|_1 tends to be particularly thorny in practice, so $latex \|e\|_2$ is almost always chosen.

That leaves us with two remaining choices:

  • Ridge Regression: J =  \|e\|_2  + \|\theta\|_2 : computationally inefficient, but sparse output.
  • Lasso Regression: J =  \|e\|_2  + \|\theta\|_1: computationally efficient, non-sparse output

What does sparse output mean? For a given model type, say y = ax^3 + bx^2 + cx + d with parameters (a, b, c, d), Ridge regression might output parameters (3, 0.5, 7.8, -0.4) whereas Lasso might give me (3, 0, 7.8, 0) . In effect, Ridge regression is performing feature selection: locating parameters that can be safely removed. Why should this be?

Geometry to the rescue!

In ridge regression, both gravity wells have convex isosurfaces. Their compromises are reached anywhere in the loss surface. In lasso regression, the diamond-shaped complexity isosurface tends to push compromises towards axes where \theta_i = 0. (In higher dimensions, the same geometry applies).

Both Ridge and Lasso regression are used in practice. The details of your application should influence your choice. I’ll also note in passing that “compromise algorithms” like Elastic Net exist, that tries to capture the best parts of either algorithm.

Takeaways

I hope you enjoyed this whirlwind tour of regularization. For a more detailed look at ridge vs lasso, I recommend reading this.

Until next time.

Deep Homology: our shared genetic toolkit

Part Of: Biology sequence
Content Summary: 1400 words, 14 min read.

Ernst Mayr once wrote that “the search for homologous genes is quite futile except in very close relatives”. But evolutionary developmental biology (aka evo devo) has turned this common knowledge on its face. All complex animals – flies and flycatchers, dinosaurs and trilobites, flatworms and humans, share a common genetic toolkit of that govern the formation and patterning of their bodies and body parts.

Let’s dive in.

Principles of Development

Bodies are not built at random. They are constrained by two important organizing principles.

  1. Bilateral symmetry. The left and right sides of our bodies tend to mirror one another.
  2. Modularity. Our genomes tend to build recurring segments, and then proceed to customize each segment.

Modularity is one of the most important principles in anatomy. In protostomes like worms and centipedes, modules are expressed in repeating segments. In deuterostomes like penguins and humans, modules are expressed in repeated somites (e.g. vertebrae).

Consider the following facts:

  • Trilobite anatomy features many identical legs. In contrast, its descendants (e.g., crayfish) have fewer, and highly specialized appendages.
  • Early teeth in e.g., sharks were numerous and undifferentiated. Contrast this with the horse, which has incisors, canines, premolars, and molars.

Williston’s Law generalizes such observations. Earlier species have many, unspecialized modular repetitions. Over time, there is a trend towards fewer, increasingly specialized parts.

Hox Genes and Localization

The genome contains coding genes, which directly encode proteins, and also regulatory genes that modify the activation profile of those coding genes. Regulatory genes form a regulatory hierarchy, whereby gene activation is controlled by increasingly specific activation profiles. The process is an abstraction hierarchy, such as feature hierarchies found in convolutional neural networks. This regulatory system learns how to e.g., deploy calcium proteins in areas where bone formation is prescribed.

Recall that every cell in an organism contains the exact same DNA. How then does one cell know to become an eye tissue, and another cell knows to become liver tissue? How is cellular differentiation possible?

In order to learn what kind of cell it is, a cell must learn where it is: differentiation requires localization. Roughly speaking, a cell will manufacture eye-specific proteins once it knows that it is located above the nose, and between the ears.

How do cells learn their position? One bit at a time. Per the intension-extension tradeoff, as cells get more location information, their localization window shrinks.

All this is nice in theory, but how does it work in practice?

Just as brain lesions shed light on neuroscience, birth defects shed light on developmental biology. Biologists have been particularly interested in homeotic mutations: mutations that cause body structure to grow in “the wrong place”. Examples include extra fingers in humans, only one central eye in sheep, and legs in the place of eyes in the fruit fly.

A closer look has revealed that homeotic mutations are caused by damage to a specific set of genes: homeobox (Hox) genes. These genes (near the top of the regulatory hierarchy) encode location information, and are conserved across species – the same genes exist in a mouse and a fruit fly:

Hox genes help explain the phenomenon of Williston’s Law (module customization). In arthropod segments, boundaries in Hox gene formation promote customizations across different segments:  

Bodybuilding & the Genetic Toolkit

In both flies and humans, the very same gene (Pax-6) orchestrates eye development, despite enormous differences in eye phenotypes. Even if you activate this gene in the wing of a fly, that wing will grow eye tissue. When the gene is deactivated, eye formation fails.  And if you transplant the fly’s Pax-6 gene into a eyeless mouse, that mouse will regain the ability to grow its eyes.

The Pax-6 gene is an example of a master bodybuilder gene. Here are two other examples from this category

  1. The DLL “Distal-Less” gene builds appendages. In chickens it builds legs; fins in fish, siphons in sea squirts, and tube feet in sea urchins.
  2. The NK2 “Tinman” gene contributes to the circulatory system. It orchestrates heart development across many different phyla.

There is more to the story than just Hox and bodybuilder genes. Other “master” genes are shared across all animal phyla. These include genes for hormones, those that regulate cell type, those involved in signaling pathways, coloration, receptor mechanisms, and other DNA binding use cases. Together, these genes comprise the genetic toolkit: a set of genes responsible for the development of multicellular organisms.

Explaining the Cambrian Explosion

Often two different species will have a feature in common.  Such facts can be explained in two different ways.

  1. Homology: the feature is shared because it was invented in a common ancestor of both species.
  2. Homoplasy (aka analogy, or convergent evolution): the feature was not derived from a common ancestor; it was invented separately and independently

One example of homology is having four limbs: our tetrapod ancestors were the first to try this new body plan. In contrast, the evolution of wings in birds and bats is an example of homoplasy – the common ancestor of these species was terrestrial. This example is nicely illustrated in a phylogeny:

Protostomes and deuterostomes use the very same Hox and bodybuilder genes. It is very unlikely that the exact same genetic toolkit was constructed twice. The most parsimonious explanation is homology. Their common ancestor, a bilateral symmetric population Urbilateria, also possessed this genetic toolkit. Specifically, we can safely conclude that Urbilateria had a toolkit if at least six or seven Hox genes, Pax-6, Distal-Less, Tinman, and a few hundred more bodybuilding genes.

Urbilaterians have not yet been found in the fossil record. However, we can do better than envisioning some featureless worm. We can use our knowledge of their genome to infer their body plans.

Because Pax-6 resides in both branches of bilaterians, Urbilateria probably had some kind of light-sensing organ. Similar inferences from homologous genes adds more detail to this portrait. The first bilateral population probably had some form of appendage, a primitive heart, a through-gut with mouth and anus, and a diverse set of cell types (including photoreceptive, nerve, muscle, digestive, secretory, phagocytic, and contractile).

One of the great mysteries of evolutionary biology is the Cambrian Explosion, an adaptive radiation where dozens of new phyla appear in the fossil record in the span of about 40 million years. The genetic toolkit was fully in place by the time of the Cambrian Explosion. It seems likely that the compilation of the toolkit was a important prerequisite for such a radiation (although ecological factors surely also played a role).

Deep Homology

Consider the evolution of the eye. Evolutionary biologists once thought eyes had evolved independently dozens or even hundreds of times. This remarkable feat of evolution was attributed to strong selective pressure: having a light-sensitive organ just pays off, and the selective pressure is overwhelming enough to induce many species towards the same end product.

But modern genomics has revealed that these “independent inventions” actually derive from the redeployment of the conserved Pax-6 gene.  The diversity of modern eyes is the result of specializations built on top of this basic genetic framework.

More generally, the deep homology hypothesis suggests that the body organization of all bilaterians derives from a substantial swathe of genes that comprise our genetic toolkit. Bilaterians do not invent novel developmental regimens whole-cloth. Rather, once the full toolkit assembled, changes in phyla occurred via alterations of regulatory circuits.

This principle sharpens how scientists explore new hypotheses. For example, humans are unique among primates for our penchant for vocal mimicry (we can learn how to produce novel sounds). But our more distant relatives (parrots, even seals) practice vocal mimicry. Because so much of our genetic material is conserved, we cannot afford to ignore similarities with even our distant relatives.

Until next time.

Intro to Gradient Descent

Part Of: Machine Learning sequence
Content Summary: 800 words, 8 min read

Parameter Space vs Feature Space

Let’s recall the equation of a line.

y = mx + b where m is the slope, and b is the y-intercept.

The equation of a line is a function that maps from inputs (x) to outputs (y). Internal to that model (the knobs inside the box) reside parameters like m and b that mold how the function works.

Any model k can be uniquely described by its parameters \{ m_k, b_k \}. Just as we can plot data in a data space using Cartesian coordinates (x, y), we can plot models in a parameter space using coordinates (m,b).

As we traverse parameter space, we can view the corresponding models in data space.

dualspace_explore.gif

As we proceed, it is very important to hold these concepts in mind.

Loss Functions

Consider the following two regression models. Which one is better?

GradientDescent_ Two Regression Models (1)

The answer comes easily: Model A. But as the data volume increases, choosing between models can become quite difficult. Is there a way to automate such comparisons?

Put another way, your judgment about model goodness is an intuition manufactured in your brain. Algorithms don’t have access to your intuitions. We need a loss function that translates intuitions into numbers.

Regression models are functions of the form \hat{y} = f(\textbf{x}) where x is the vector of features (predictors) used to generate the label (prediction). We can define error as \hat{y} - y. In fact, we typically reserve the word error for test data, and residual for train data. Here are the residuals for our two regression models:

GradientDescent_ Loss Function Residuals

The larger the residuals, the worse the model. Let’s use the residual vector to define a loss function. To do this, in the language of database theory, we need to aggregate the column down to a scalar. In the language of linear algebra, we need to compute the length of the vector.

Everyone agrees that residuals matter, when deriving the loss function. Not everyone agrees how to translate the residual vector into a single number. Let me add a couple examples to motivate:

Sum together all prediction errors.

But then an deeply flawed model with residual vector  [ -30, 30, -30, 30] earns the same score as a “perfect model” [0, 0, 0, 0]. The morale: positive and negative errors should not cancel each other out.

Sum together the magnitude of the prediction errors.

But then a larger dataset costs more than a small one. A good model against a large dataset with residual vector [ 1, -1, 1, -1, 1, -1 ] earns the same score as a poor model against small data [ 3, -3 ]. The morale: cost functions should be invariant to data volume.

Find the average magnitude of the prediction errors.

This loss function suffers from fewer bugs. It even has a name: Mean Absolute Error (MAE), also known as the L1-norm.

There are many other valid ways of defining a loss function that we will explore later. I  just used the L1-norm to motivate the topic.

Grading Parameter Space

Let’s return to the question of evaluating model performance. For the following five models, we intuitively judge their performance as steadily worsening:

dataspace_loss (2)

With loss functions, we convert these intuitive judgments into numbers. Let’s include these loss numbers in label space, and encode them as color.

dualspace_loss (2)

Still with me? Something important is going on here.

We have examined the loss of five models. What happens if we evaluate two hundred different models? One thousand? A million? With enough samples, we can gain a high-resolution view of the loss surface. This loss surface can be expressed with loss as color, or loss as height along the z-axis.

sampling_loss_surface (2)

In this case, the loss surface is convex: it is in the shape of a bowl.

Navigating the Loss Surface

The notion of a loss surface takes a while to digest. But it is worth the effort. The loss surface is the reason machine learning is possible.

By looking at a loss surface, you can visually identify the global minimum: the model instantiation with the least amount of loss. In our example above, that is (7,-14) encodes the model y = 7x - 14 with the smallest loss L(\theta) = 2.

Unfortunately, computing the loss surface is computationally intractable. It takes too long to calculate the loss of every possible model. How can we do better?

  1. Start with an arbitrary model.
  2. Figure out how to improve it.
  3. Repeat.

One useful metaphor for this kind of algorithm is a flashlight in the dark. We can’t see the entire landscape, but our flashlight provides information about our immediate surroundings.

flashlight_local_search

But what local information can we use to decide where to move in parameter space? Simple: the gradient (i.e., the slope)! If we move downhill in this bowl-life surface, we will come to a rest at the set of best parameters.

A ball rolling down a hill.

This is how gradient descent works, in both spaces:

This is how prediction machines learn from data.

Until next time.

Bias vs Variance

Part Of: Machine Learning sequence
Followup To: Regression vs Classification
Content Summary: 800 words, 8 min read

A Taxonomy of Models

Last time, we discussed how to create prediction machines. For example, given an animal’s height and weight, we might build a prediction machine to guess what kind of animal it is. While this sounds complicated, in this case prediction machines are simple region-color maps, like these:

Partitioning- Comparing Classification Outputs

These two classification models are both fairly accurate, but differ in their complexity. 

But it’s important to acknowledge the possibility of erroneous-simple and erroneous-complex models. I like to think of models in terms of an accuracy-complexity quadrant.

BiasVariance_ Classification Quadrant (1)

This quadrant is not limited to classification. Regression models can also vary in their accuracy, and their complexity.

BiasVariance_ Both Quadrants (1)

A couple brief caveats before we proceed.

  • This quadrant concept is best understood as a two-dimensional continuum, rather than a four-category space. More on this later.
  • Here “accuracy” tries to capture lay intuitions about prediction quality & performance. I’m not using it in the metric sense of “alternative to F1 score”.

Formalizing Complexity

Neural networks are often used as classification models against large numbers of images. The complexity of the models tends to correlate with the number of layers. For some models then, complexity is captured in the number of parameters.

While not used much in the industry, polynomial models are pedagogically useful examples of regression models. Here, the degree of the polynomial expresses the complexity of the model: a degree-eight polynomial has more “bumps” than a degree-two polynomial.

Consider, however, the difference between the following regression models

y_A = 4x^4 + 0.0001x^3 + 0.0007x^2 + 2.1x + 7
y_B = 4x^4 + 2.1x + 7

Model A uses five parameters; Model B uses three. But their predictions are, for all practical purposes, identical. Thus, the size of each parameter is also relevant to the question of complexity.

The above approaches rely on the model’s parameters (its “visceral organs”) to define complexity. But it is also possible to rely on the model’s outputs (its “behaviors”) to achieve the same task. Consider again the classification decision boundaries above. We can simply measure the spatial frequency (the “squiggliness” of the boundary) as another proxy towards complexity.

Here, then, are three possible criteria for complexity:

  1. Number of parameters
  2. Size of parameters
  3. Spatial frequency of decision manifold

Thus, operationalizing the definition of “complexity” is surprisingly challenging. But there is another way to detect whether a model is too complex…

Simplicity as Generalizability

Recall our distinction between training and prediction:

train_predict

We compute model performance on historical data. We can contrast this with model performance again future data. 

BiasVariance_ Operationalizing the Quadrant (2)

Take a moment to digest this image. What is it telling you?

Model complexity is not merely aesthetically ugly. Rather, complexity is the enemy of generalization. Want to future-proof your model? Simplicity might help!

Underfitting vs Overfitting

There is another way of interpreting this tradeoff, that emphasizes the continuity of model complexity. Starting from a very simple model, increases in model complexity will improve both historical and future error. The best response to underfitting is increasing the expressivity of your model.

But at a certain point, your model will become too complex, and begin to overfit the data. At that point, your historical error will continue to decrease, but your future error will increase.

BiasVariance_ Complexity vs Accuracy Graph

Data Partitioning: creating a Holdout Set

So you now appreciate the importance of striking a balance between accuracy and simplicity. That’s all very nice conceptually, but how might you go about building a well-balanced prediction machine?

The bias trade-off is only apparent when the machine is given new data!  “If only I had practiced against unseen test data earlier”, the statistician might say, “then I could have discovered how complex to make my model before it was too late”.

Read the above regret again. It is the germinating seed of a truly enormous idea.

Many decades ago, some creative mind took the above regret and sought to reform it: “What stops me from treating some of my old, pre-processed data as if it were new? Can I not hide data from myself?”

holdout_set

This approach, known as data partitioning, is now ubiquitous in the machine learning community.  Historical-Known data is the training set, Historical-Novel data is the test set, aka the holdout set.

How much data should we put in the holdout set? While the correct answer ultimately derives from the particular application domain, a typical rule of thumb:

  • On small data (~100 thousand records), data are typically split to 80% train, 20% test
  • On large data (~10 billion records), data are typically split to 95% train, 5% test

Next time, we will explore cross-validation (CV). Cross-validation is sometimes used instead of, and other times in addition to, data partitioning.

See you then!