Intro to Regularization

Part Of: Machine Learning sequence
Followup To: Bias vs Variance, Gradient Descent
Content Summary: 1100 words, 11 min read

In Intro to Gradient Descent, we discussed how loss functions allow optimization methods to locate high-performance models.

But in Bias vs Variance, we discussed how model performance isn’t the only thing that matters. Simplicity promotes generalizability.

One way to enhance simplicity is to receive the model discovered by gradient descent, and manually remove unnecessary parameters.

But we can do better. In order to automate parsimony, we can embed our preference for simplicity into the loss function itself.

But first, we need to quantify our intuitions about complexity.

Formalizing Complexity

Neural networks are often used as classification models against large numbers of images. The complexity of the models tends to correlate with the number of layers. For some models then, complexity is captured in the number of parameters.

While not used much in the industry, polynomial models are pedagogically useful examples of regression models. Here, the degree of the polynomial expresses the complexity of the model: a degree-eight polynomial has more “bumps” than a degree-two polynomial.

Consider, however, the difference between the following regression models

y_A = 4x^4 + 0.0001x^3 + 0.0007x^2 + 2.1x + 7

y_B = 4x^4 + 2.1x + 7

Model A uses five parameters; Model B uses three. But their predictions are, for all practical purposes, identical. Thus, the size of each parameter is also relevant to the question of complexity.

The above approaches rely on the model’s parameters (its “visceral organs”) to define complexity. But it is also possible to rely on the model’s outputs (its “behaviors”) to achieve the same task. Consider again the classification decision boundaries above. We can simply measure the spatial frequency (the “squiggliness” of the boundary) as another proxy towards complexity.

Here, then, are three possible criteria for complexity:

  1. Number of parameters
  2. Size of parameters
  3. Spatial frequency of decision manifold

Thus, operationalizing the definition of “complexity” is surprisingly challenging.

Mechanized Parsimony

Recall our original notion of the performance-complexity quadrant. By defining our loss function exclusively in terms of the residual error, gradient descent learns to prefer accurate models (to “move upward”). Is there a way to induce leftward movement as well?

To have gradient descent respond to both criteria, we can embed them into the loss function. One simple way to accomplish this: addition.

This technique is an example of regularization.

Depending on the application, sometimes the errors are much larger than the parameters or vice versa. In order to assure the right balance between these terms, people usually add a hyperparameter to the regularized loss function J = \|e\|_2 + \lambda \|\theta\|_2

A Geometric Interpretation

Recall Einstein’s insight that gravity is curvature of spacetime. You can envision such curvature as a ball pulling on a sheet. Here is the gravity well of bodies of the solar system:

Every mass pulls on every other mass! Despite the appearance of the above, Earth does “pull on” Saturn.

The unregularized cost function we saw last time creates a convex loss function, which we’ll interpret as a gravity well centered around parameters of best fit. If we replace J with a function that only penalizes complexity, a corresponding gravity well appears, centered around parameters of zero size.

If we keep both terms, we see the loss surface now has two enmeshed gravity wells. If scaled appropriately, the “zero attractor” will pull the most performant solution (here \theta = (8,7) towards a not-much-worse yet simpler model \theta = (4,5).

More on L1 vs L2

Previously, I introduced the L1 norm aka mean average error MAE

\|x\|_1 = (\sum_{i=1}^{n} \lvert x_i\rvert^1)^1

Another loss function is the L2 norm aka root mean squared error RMSE

\|x\|_2 = (\sum_{i=1}^{n} \lvert x_i\rvert^2)^{1/2}

The L1 and L2 norms respectively correspond to Euclidean vs Manhattan distance (roughly, plane vs car travel):

One useful way to view norms is by their isosurface. If you can travel in any direction for a finite amount of time, the isosurface is the frontier you might sketch.

The L2 isosurface is a circle. The L1 isosurface is a diamond.

  • If you don’t change direction, you can travel the “normal” L2 distance.
  • If you do change direction, your travel becomes inefficient (since “diagonal” travel along the hypotenuse is forbidden).

The Lp Norm as Superellipse

Consider again the formulae for the L1 and L2 norm. We can generalize these as special cases of the Lp norm:

\|x\|_p = (\sum_{i=1}^{n} \lvert x_i\rvert^p)^{1/p}

Here are isosurfaces of six exemplars of this norm family:

On inspection, the above image looks like a square that’s inflating with increasing p. In fact, the Lp norm generates a superellipse.

As an aside, note that the boundaries of the Lp norm family operationalize complexity rather “intuitively”. For the L0 norm, complexity is the number of non-zero parameters. For the Linf norm, complexity is the size of the largest parameter.

Lasso vs Ridge Regression

Why the detour into geometry?

Well, so far, we’ve expressed regularization as J  = \|e\|_p + \lambda \| \theta \|_p But most engineers choose between the L1 and L2 norms. The L1 norm is not convex (bowl shaped), which tends to make gradient descent more difficult. But the L1 norm is also more robust to outliers, and has other benefits.

Here are two options for the residual norm:

  • \|e\|_2: sensitive to outliers, but a stable solution
  • \|e\|_1: robust to outlier, but an unstable solution

The instability of \|e\|_1 tends to be particularly thorny in practice, so $latex \|e\|_2$ is almost always chosen.

That leaves us with two remaining choices:

  • Ridge Regression: J =  \|e\|_2  + \|\theta\|_2 : computationally inefficient, but sparse output.
  • Lasso Regression: J =  \|e\|_2  + \|\theta\|_1: computationally efficient, non-sparse output

What does sparse output mean? For a given model type, say y = ax^3 + bx^2 + cx + d with parameters (a, b, c, d), Ridge regression might output parameters (3, 0.5, 7.8, -0.4) whereas Lasso might give me (3, 0, 7.8, 0) . In effect, Ridge regression is performing feature selection: locating parameters that can be safely removed. Why should this be?

Geometry to the rescue!

In ridge regression, both gravity wells have convex isosurfaces. Their compromises are reached anywhere in the loss surface. In lasso regression, the diamond-shaped complexity isosurface tends to push compromises towards axes where \theta_i = 0. (In higher dimensions, the same geometry applies).

Both Ridge and Lasso regression are used in practice. The details of your application should influence your choice. I’ll also note in passing that “compromise algorithms” like Elastic Net exist, that tries to capture the best parts of either algorithm.


I hope you enjoyed this whirlwind tour of regularization. For a more detailed look at ridge vs lasso, I recommend reading this.

Until next time.


Intro to Gradient Descent

Part Of: Machine Learning sequence
Content Summary: 800 words, 8 min read

Parameter Space vs Feature Space

Let’s recall the equation of a line.

y = mx + b where m is the slope, and b is the y-intercept.

The equation of a line is a function that maps from inputs (x) to outputs (y). Internal to that model (the knobs inside the box) reside parameters like m and b that mold how the function works.

Any model k can be uniquely described by its parameters \{ m_k, b_k \}. Just as we can plot data in a data space using Cartesian coordinates (x, y), we can plot models in a parameter space using coordinates (m,b).

As we traverse parameter space, we can view the corresponding models in data space.


As we proceed, it is very important to hold these concepts in mind.

Loss Functions

Consider the following two regression models. Which one is better?

GradientDescent_ Two Regression Models (1)

The answer comes easily: Model A. But as the data volume increases, choosing between models can become quite difficult. Is there a way to automate such comparisons?

Put another way, your judgment about model goodness is an intuition manufactured in your brain. Algorithms don’t have access to your intuitions. We need a loss function that translates intuitions into numbers.

Regression models are functions of the form \hat{y} = f(\textbf{x}) where x is the vector of features (predictors) used to generate the label (prediction). We can define error as \hat{y} - y. In fact, we typically reserve the word error for test data, and residual for train data. Here are the residuals for our two regression models:

GradientDescent_ Loss Function Residuals

The larger the residuals, the worse the model. Let’s use the residual vector to define a loss function. To do this, in the language of database theory, we need to aggregate the column down to a scalar. In the language of linear algebra, we need to compute the length of the vector.

Everyone agrees that residuals matter, when deriving the loss function. Not everyone agrees how to translate the residual vector into a single number. Let me add a couple examples to motivate:

Sum together all prediction errors.

But then an deeply flawed model with residual vector  [ -30, 30, -30, 30] earns the same score as a “perfect model” [0, 0, 0, 0]. The morale: positive and negative errors should not cancel each other out.

Sum together the magnitude of the prediction errors.

But then a larger dataset costs more than a small one. A good model against a large dataset with residual vector [ 1, -1, 1, -1, 1, -1 ] earns the same score as a poor model against small data [ 3, -3 ]. The morale: cost functions should be invariant to data volume.

Find the average magnitude of the prediction errors.

This loss function suffers from fewer bugs. It even has a name: Mean Absolute Error (MAE), also known as the L1-norm.

There are many other valid ways of defining a loss function that we will explore later. I  just used the L1-norm to motivate the topic.

Grading Parameter Space

Let’s return to the question of evaluating model performance. For the following five models, we intuitively judge their performance as steadily worsening:

dataspace_loss (2)

With loss functions, we convert these intuitive judgments into numbers. Let’s include these loss numbers in label space, and encode them as color.

dualspace_loss (2)

Still with me? Something important is going on here.

We have examined the loss of five models. What happens if we evaluate two hundred different models? One thousand? A million? With enough samples, we can gain a high-resolution view of the loss surface. This loss surface can be expressed with loss as color, or loss as height along the z-axis.

sampling_loss_surface (2)

In this case, the loss surface is convex: it is in the shape of a bowl.

Navigating the Loss Surface

The notion of a loss surface takes a while to digest. But it is worth the effort. The loss surface is the reason machine learning is possible.

By looking at a loss surface, you can visually identify the global minimum: the model instantiation with the least amount of loss. In our example above, that is (7,-14) encodes the model y = 7x - 14 with the smallest loss L(\theta) = 2.

Unfortunately, computing the loss surface is computationally intractable. It takes too long to calculate the loss of every possible model. How can we do better?

  1. Start with an arbitrary model.
  2. Figure out how to improve it.
  3. Repeat.

One useful metaphor for this kind of algorithm is a flashlight in the dark. We can’t see the entire landscape, but our flashlight provides information about our immediate surroundings.


But what local information can we use to decide where to move in parameter space? Simple: the gradient (i.e., the slope)! If we move downhill in this bowl-life surface, we will come to a rest at the set of best parameters.

A ball rolling down a hill.

This is how gradient descent works, in both spaces:

This is how prediction machines learn from data.

Until next time.

Counting: The Fourfold Way

Part Of: Statistics, Algebra sequences
Content Summary: 1100 words, 11 min read

The Fundamental Principle of Counting

We often care to count the number of possible outcomes for multiple events.

Example 1. Consider purchasing a lunch with the following components:

  • Burger b \in \{ Chicken, Beef \}
  • Side s \in \{ Fries, Chips \}
  • Drink d \in \{ Fanta, Coke, Sprite \}

How many lunch outcomes are possible?

Three approaches to counting suggest themselves. We might make a list. But this process can be error prone. Other representations are more systematic: we can build a tree, or imagine a (hyper)-volume. Each strategy converges on the same answer: 12 possible lunches.

Permutation_ Trees of Events (1)

Can we generalize? Yes, with the help of the fundamental principle of counting (aka the rule of multiplication). For any event A with a possible outcomes, and another event B with b possible outcomes, the number of possible outcomes for composite event A \cup B is a*b

How does counting work for a repeating event? For an event with n possibilities occurs k times, there are k^n possible outcomes.

Example 2. How many numbers can be represented by a byte (8 bits)?

Each bit has two possible assignments: zero or one. For eight such “bit events”, we have 2^8 = 256 possible outcomes.


Example 3. A trifecta bet guesses which horse will place first, which second, and which third.  How many such bets are possible in a 9-horse race?

Each medal has nine possible assignments. For three such “medal events”, we have 9^3 = 729 possible outcomes.

This answer is completely wrong. To understand why, consider the lottery machine:


  • In Example 2, a single value (e.g., 0) can freely be assigned to multiple bits. Every time you draw a bit from the “possibility machine”, it is replaced when the next bit is drawn. Sampling with replacement means that each event is exactly the same.
  • In Example 3, a single horse (e.g., Secretariat) cannot be assigned multiple medals. Every time you draw a horse from the “possibility machine”, it cannot be drawn for subsequent events. Sampling without replacement means that each event has diminishing numbers of possibilities.

Definition 4. A permutation is a list of outcomes drawn without replacement.

For the trifecta bet, how many permutations exist? Well, 9 different horses that earn the gold. Given that one horse won the gold, 8 different horses that can earn the silver. Then there are 7 different horses that can earn bronze. Thus, there are 9 \times 8 \times 7 = 504 possible trifecta bets.

Permutation_ 9 perm 3

Similar to how exponentiation is defined as repeated multiplication, a factorial is defined as slowly-decrementing multiplication.

9! = 9 \times 8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1

But we only want 9 \times 8 \times 7. How can we get rid of the other terms? By division, of course!

9 \times 8 \times 7 = \dfrac{9 \times 8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1}{6 \times 5 \times 4 \times 3 \times 2 \times 1} = \dfrac{9!}{6!}

Why did we use the number 6? Because if three of our nine horses place, six do not place. So a more general way to write this equation,


More generally, if you have n items and want to find the number of ways k items can be ordered:

Equation 5: Permutation. P(n, k) = \dfrac{n!}{(n-k)!}


In contrast to permutations, for combinations, order doesn’t matter. A permutation is a list, a combination is a set.

A boxed trifecta bet requires correctly which three horses will place first, second and third (order doesn’t matter). A trifecta bet selects a permutations; a boxed trifecta bet selects a combination.

Imagine only four horses in the race. That’s P(4,3) = \dfrac{4!}{4-3!} = 24 possible trifecta bets. But how many boxed trifecta bets are possible?

Combinations treat duplicates as a single entry. For example, abc and acb are equivalent for a boxed trifecta bet. We can identify four groups, with six equivalent permutations each:

Permutation_ 4 choose 3

In general, how many winner duplicates exist? How many ways can we shuffle k winners? Well, if you have k winners and are wondering how many permutations exist for that entire set… that’s P(k,k)!

Equation 6: Combination. C(n, k) = \dfrac{P(n,k)}{P(k,k)} = \dfrac{P(n,k)}{k!} = \dfrac{n!}{(n-k)! k!}

For an example of combinations used to solve a real problem, I recommend this post.

The Fourth Way: Stars and Bars

Example 7. You have k=3 cookies a, b, c, d to give to n=4 kids. How many possible ways are there to do so?

In the case of medals and horses, we claimed four solutions: \{ a, b, c\}\{ a, b, d\}\{ a, c, d\}, and \{ b, c, d\}. But there is an important difference: horse-less medals are impossible, but cookie-less children are not! So we need to account for situations like \{ a, a, a\}, with one child getting all of the cookies.

We can use the traditional bins-as-containers metaphor to visualize outcomes (top row). Or we can instead visualize bin boundaries (bottom row). This visualization strategy is called stars and bars.

Combinatorics_ Stars and Bars (2)

How many kid-cookie outcomes are possible? The answer becomes apparent only if we use stars and bars  (bottom row). Every possible shuffling of the stars in those squares produces a valid event. That is, \binom{6}{3}.

How many objects are possible in general? There are n stars (kids). Since bars represent bin boundaries, there are n-1 bars. Thus:

Equation 8: Multi-Combination. C(n, k) = \binom{n+(k-1)}{k}

The Fourfold Way

Every example we have seen differentiates possibilities and outcomes. We will use the metaphor of balls for outcomes (something concrete) and bins for possibilities (something to “clothe” outcomes).

Combinatorics_ Possibilities vs Outcomes

Equation 9. An event is a function that maps outcomes to possibilities:

Event : Outcomes \rightarrow Possibilities

Combinatorics_ Events as Functions (2)

This function can be compactly represented as bbd.

Functions require every element of the domain to map to the codomain. Event functions require no unrealized outcomes. That is: every outcomes manifests a possibility. Every ball is given a bin.

We saw previously that combinations and permutations don’t allow events like aab and ccc. A single horse cannot win multiple medals. Multiply-realized possibilities are not allowed.

Recall the definitions of injective, surjective and bijective functions. This requirement is the injective property. Sampling without replacement is the same thing as injectivity.

Tuple, permutation, combination, and multi-combination. This is the fourfold way.  The Way can be made more general by counting situations where the possibilities are unlabeled, and the event function meets the surjection property. But for details on the more complete twelvefold way, I recommend this post.

Combinatorics_ Fourfold Way

Towards a Rosetta Stone

Now, consider all possible functions for 4 bins and 3 balls:

Combinatorics_ Rosetta Stone (4, 3)

What do the equations above have to do with this shape? Well, each way of counting corresponds with a different subset of this broader shape:

Combinatorics_ Shape of The Way (1)

I leave it to the interested reader to ponder, how such a jagged shape can be represented by these four relatively clean formulae.

Until next time.

Related Resources

An Introduction to Language Models

Part Of: Language sequence
Content Summary: 1500 words, 15 min read

Why Language Models?

In the English language, ‘e’ appears more frequently than ‘z’. Similarly,  “the” occurs more frequently than “octopus”. By examining large volumes of text, we can learn the probability distributions of characters and words.

Language Models_ Letter and Word Frequency

Roughly speaking, statistical structure is distance from maximal entropy. The fact that the above distributions are non-uniform means that English is internally recoverable: if noise corrupts part of a message, the surrounding can be used to recover the original signal. Statistical structure is also used to reverse engineer secret codes such as the Roman cipher.

We can illustrate the predictability of English by generating text based on the above probability distributions. As you factor in more of the surrounding context, the utterances begin to sound less alien, and more like natural language.

Language Model_ Structure of English

A language model exploits the statistical structure of a language to express the following:

  • Assign a probability to a sentence P(w_1, w_2, w_3, \ldots w_N)
  • Assign probability of an upcoming word P(w_4 \mid w_1, w_2, w_3)

Language models are particularly useful in language perception, because they can help interpret ambiguous utterances. Three such applications might be,

  • Machine Translation: P(\text{high winds tonight}) > P(\text{large winds tonight})
  • Spelling correction: P(\text{fifteen minutes from}) > P(\text{fifteen minuets from})
  • Speech Recognition: P(\text{I saw a van}) > P(\text{eyes awe of an})

Language models can also aid in language production. One example of this is autocomplete-based typing assistants, commonly displayed within text messaging applications. 

Towards N-Grams

A sentence is a sequence of words \textbf{w} = (w_1, w_2, \ldots, w_3). To model the joint probability over this sequence, we use the chain rule:

p(\text{this is the house})

= p(\text{this})p(\text{is}\mid\text{this})p(\text{the}\mid\text{this is})p(\text{house}\mid\text{this is the})

As the number of words grows, the size of our conditional probability tables (CPTs) quickly becomes intractable. What is to be done? Well, recall the Markov assumption we introduced in Markov chains.


The Markov assumption constrains the size of our CPTs. However, sometimes we want to condition on more (or less!) than just one previous word. Let v denote how many variables we admit in our context. A variable order Markov model (VOM) allows v elements in its context: p(s_{t+1} | s_{t-v}, \ldots, s_{t}). Then the size of our CPT is n=v+1, because we must take our original variable into account. Thus an N-gram is defined as a v-order Markov model. By far, the most common choices are trigrams, bigrams, and unigrams:

Language Models_ Ngram comparison (1)

We have already discussed Markov Decision Processes, used in reinforcement learning applications.  We haven’t yet discussed MRFs and HMMs. VOMs represent a fourth extension: the formalization of N-grams. Hopefully you are starting to appreciate the  richness of this “formalism family”. 🙂

Language Model_ Markov Formalisms (1)

Estimation and Generation

How can we estimate these probabilities? By counting!


Let’s consider a simple bigram language model. Imagine training on this corpus:

This is the cheese.

That lay in the house that Alice built.

Suppose our trained LM encounters the new sentence “this is the house”. It estimates its probability as:

p(\text{this is the house})

= p(\text{this})p(\text{is} \mid \text{this})p(\text{the} \mid \text{is})p(\text{house} \mid \text{the}) 

= \dfrac{1}{12} * 1 * 1 * \dfrac{1}{2} = \dfrac{1}{24}

How many problems do you see with this model? Let me discuss two.

First, we have estimated that p(\text{this}) = \dfrac{1}{24}. And it is true that “this” occurs only once in our toy corpus above. But out of two sentences, “this” leads half of them. We can express this fact by adding a special START token into our vocabulary.

Second, recall what happens when language models generate speech. Once they begin a sentence, they are unable to end it! Adding a new END token will allow our model the terminate a sentence, and begin a new one.

With these new tokens in hand, we update our products as follows:

Language Models_ Sentence Estimation (1)

A couple other “bug fixes” I’ll mention in passing:

  • Out-of-vocabulary words are given zero probability. It helps to add an unknown  (UNK) pseudoword and assign it some probability mass.
  • LMs prefer very short sentences (sequential multiplication is monotonic decreasing). We can address this e.g., normalizing by sentence length.


In the last sentence in the image above, we estimate p(END|house) = 0, because we have no instances of this two-word sequence in our toy corpus. But this causes our language model to fail catastrophically: the sentence is deemed impossible (0% probability).

This problem of zero probability increases as we increase the complexity of our N-grams. Trigram models are more accurate than bigrams, but produce more p=0 events. You’ll notice echoes of the bias-variance (accuracy-generalization) tradeoff.

How can we remove zero counts? Why not add one to every word? Of course, we’d then need to increase the size of our denominator, to ensure the probabilities still sum to one. This is Laplace smoothing

Language Model_ Laplace Smoothing

In a later post, we will explore how (in a Bayesian framework) such smoothing algorithms can be interpreted as a form of regularization (MAP vs MLE).

Due to its simplicity, Laplace smoothing is well-known  But several algorithms achieve better performance.  How do they approach smoothing?

Recall that a zero count event in an N-gram is not likely to occur in (N-1)-gram model. For example, it is very possible that the phrase “dancing were thought” hasn’t been seen before. 

Language Model_ Backoff Smoothing

While a trigram model may balk at the above sentence, we can fall back on the bigram and/or unigram models. This technique underlies the Stupid Backoff algorithm.

As another variant on this theme, some smoothing algorithms train multiple N-grams, and essentially use interpolation as an ensembling method. Such models include Good-Turing and Kneser-Ney algorithms.

Beam Search

We have so far seen examples of language perception, which assigns probabilities to text. Let us consider language perception, which generates text from the probabilistic model. Consider machine translation. For a French sentence \textbf{x}, we want to produce the English sentence \textbf{y} such that y^* = \text{argmax } p(y\mid x).  

This seemingly innocent expression conceals a truly monstrous search space. Deterministic search has us examine every possible English sentence. For a vocabulary size V, there are V^2 possible two-word sentences. For sentences of length n, our time complexity of our brute force algorithm is O(V^n).

Since deterministic search is so costly, we might consider greedy search instead. Consider an example French sentence \textbf{x} “Jane visite l’Afrique en Septembre”. Three candidate translations might be,

  • y^A: Jane is visiting Africa in September
  • y^B: Jane is going to Africa in September
  • y^C: In September, Jane went to Africa

Of these, p(y^A|x) is the best (most probable) translation. We would like greedy search to recover it.

Greedy search generates the English translation, one word at a time. If “Jane” is the most probable first word \text{argmax } p(w_1 \mid x), then the next word generated is \text{argmax } p(w_2 \mid \text{Jane}, x). However, it is not difficult to contemplate p(\text{going}\mid\text{Jane is}) > p(\text{visiting}\mid\text{Jane is}), since the word “going” is used so much more frequently in everyday conversation. These problems of local optima happen surprisingly often.

The deterministic search space is too large, and greedy search is too confining. Let’s look for a common ground.

Beam search resembles greedy search in that it generates words sequentially. Whereas greedy search only drills one such path in the search tree, beam search drills a finite number of paths. Consider the following example with beamwidth b=3


As you can see, beam search elects to explore y^A as a “second rate” translation candidate despite y^B initially receiving the most probability mass. Only later in the sentence does the language model discover the virtues of the y^A translation. 🙂

Strengths and Weaknesses

Language models have three very significant weaknesses.

First, language models are blind to syntax. They don’t even have a concept of nouns vs. verbs!  You have to look elsewhere to find representations of pretty much any latent structure discovered by linguistic and psycholinguistic research.

Second, language models are blind to semantics and pragmatics. This is particularly evident in the case of language production: try having your SMS autocomplete write out an entire sentence for you. In the real world, communication is more constrained: we choose the most likely word given the semantic content we wish to express right now.

Third, the Markov assumption is problematic due to long-distance dependencies. Compare the phrase “dog runs” vs “dogs run”. Clearly, the verb suffix depends on the noun suffix (and vice versa). Trigram models are able to capture this dependency. However, if you center-embed prepositional phrases, e.g., “dog/s that live on my street and bark incessantly at night run/s”, N-grams fail to capture this dependency.

Despite these limitations, language models “just work” in a surprising diversity of applications. These models are particularly relevant today because it turns out that Deep Learning sequence models like LSTMs share much in common with VOMs. But that is a story we shall have to take up next time.

Until then.


The Symmetric Group

Part Of: Algebra sequence
Followup To: An Introduction to Geometric Group Theory
Content Summary: 1800 words, 18 min read

On Permutations

In the last few posts, we have discussed algebraic structures whose sets contain objects (e.g., numbers). Now, let’s consider structures over a set of functions, whose binary operation is function composition.

Definition 1. Consider two functions f and g. We will denote function composition of g(f(x)) as f \bullet g. We will use this notation instead of the more common g \circ f.  Both represent the idea “apply f, then g“. 

Consider G = (\left\{ f(x) = 2x, g(x) = x+1, h(x) = 2x + 2, i(x) = 2x+1 \right\}, \bullet)

Is this a group? Let’s check closure:

  • g \bullet f = 2(x+1) = h \in G
  • f \bullet g = (2x)+1 = i \in G
  • f \bullet h = 2(2x)+2 \not\in G

Closure is violated. G isn’t even a magma! Adding j(x) = 4x+2 to the underlying set exacerbates the problem: then both f \bullet j \not\in G and g \bullet j \not\in G.

So it is hard to establish closure under function composition. Can it be done?

Yes. Composition exhibits closure on sets of permutation functions. Recall that a permutation is simply a bijection: it re-arrange a collection of things. For example, here are the six possible bijections over a set of three elements.

Symmetry Group_ Permutation Options

Definition 2. The symmetric group S_n denotes composition over a set of all bijections (permutations) over some set of n objects. The symmetric group is then of order n!.

The underlying set of S_{3} is the set of all permutations over a 3-element set. It is of order 3! = 3 \times 2 \times 1 = 6.

This graphical representation of permutations is rather unwieldy. Let’s switch to a different notation system!

Notation 3: Two Line Notation. We can use two lines to denote each permutation mapping \phi. The top row represents the original elements x, the bottom represents where each element has been relocated \phi(x).

Symmetric Group_ Two Line Notation

Two line notation is sometimes represented as an array, with the top row as matrix row, and bottom denoting matrix column. Then the identity matrix represents the identity permutation.

Definition 4. A cycle is a sequence of morphisms that forms a closed loop. An n-cycle is a cycle of length n. A 1-cycle does nothing. A 2-cycle is given the special name transposition. S_3 has two permutations with 3-cycles: can you find them?

Theorem 5. Cycle Decomposition Theorem. Every permutation can be decomposed into disjoint cycles. Put differently, a node cannot participate in more than one cycle. If it did, its parent cycles would merge.

Notation 6: Cycle Notation. Since permutations always decompose into cycles, we can represent them as (A_1\ A_2\ \ldots\ A_n), pronounced “A_1 goes to A_2 goes to …”.

Symmetric Group_ Cycle Notation (1)

Cycle starting element does not matter: (A\ C\ B) = (C\ B\ A) = (B\ A\ C).

The Cycle Algorithm

It is difficult to tell visually the outcome of permutation composition. Let’s design an algorithm to do it for us!

Algorithm 7: Cycle Algorithm. To compose two permutation functions a(S) and b(S), take each element s \in S and follow its arrows until you find the set of disjoint cycles. More formally, compose these functions x times until you get (a \bullet b)^x(s) = s.

Here’s a simple example from S_{3} = ( \left\{ 0, 1, 2, 3, 4, 5 \right\}, \bullet ).

Symmetric Group_ Cycle Algorithm S3 Ex (2)

Make sense? Good! Let’s try a more complicated example from S_4.

Symmetric Group_ Cycle Algorithm S4 Ex (1)

A couple observations are in order.

  • 1-cycles (e.g., (A)) can be omitted: their inclusion does not affect algorithm results.
  • Disjoint cycles commute: (B\ D),(A\ C) = (A\ C),(B\ D). Contrast this with composition, which does not commute (B\ D) \bullet (A\ C) \neq (A\ C) \bullet (B\ D).

Now, let’s return to S_{3} for a moment, with its set of six permutation functions. Is this group closed? We can just check every possible composition:

Symmetric Group_ Permutation Composition

From the Cayley table on the right, we see immediately that S_{3} is closed (no new colors) and non-Abelian (not diagonal-symmetric).

But there is something much more interesting in this table. You have seen it before. Remember the dihedral group D_{3}? It is isomorphic to S_{3}!

Symmetric Group_ D3 S3 Isomorphism (3)

If you go back to the original permutation pictures, this begins to make sense. Permutations 1 and 2 resemble rotations/cycles; 3, 4, and 5 perform reflections/flips.

Generators & Presentations

In Theorem 7, we learned that permutations decompose into cycles. Let’s dig deeper into this idea. 

Theorem 8. Every n-cycle can be decomposed into some combination of 2-cycles. In other words, cycles are built from transpositions.

The group S_{3} = \left\{ 0, 1, 2, 3, 4, 5 \right\}, \bullet) has three transpositions 3 = (A\ B), 4 = (B\ C), and 5 = (A\ C).

Transpositions are important because they are generators: every permutation can be generated by them. For example, 1 = 3 \bullet 4. In fact, we can lay claim to an even stronger fact:

Theorem 9. Every permutation can be generated by adjacent transpositions. Every permutation S_{n} = \langle (1 2), (2 3), \ldots, (n-1\ n) \rangle.

By the isomorphism S_3 \cong D_3 , we can generate our “dihedral-looking” Cayley graph by selecting generators 1=(A\ B\ C) and 3 = (B\ C).

But we can use Theorem 9 to produce another, equally valid Cayley diagram. There are two adjacent transpositions in S_{3}: 3 and 4. All other permutations can be written in terms of these two generators:

  • 0 = 3 \bullet 3.
  • 1 = 3 \bullet 4.
  • 2 = 4 \bullet 3.
  • 5 = 3 \bullet 4 \bullet 3

This allows us to generate a transposition-based Cayley diagram. Here are the dihedral and transposition Cayley diagrams, side by side:

Symmetric Group_ Cayley Diagram (4)

We can confirm the validity of the transposition diagram by returning to our multiplication table: 1 \bullet 3 = 5 means a green arrow 1 \rightarrow 5.

Note that the transposition diagram is not equivalent cyclic group C_6, because arrows in the latter are monochrome and unidirectional.

We’re not quite done! We can also rename our set elements to employ generator-dependent names, by “moving clockwise”:

Symmetric Group_ Abstract Cayley Diagram

We could just as easily have “moved counterclockwise”, with names like 3 \mapsto r, 1 \mapsto rl. And we can confirm by inspection that, in fact, rl = lrlr etc.

Using the original clockwise notation, one presentation of S_3 becomes:

S_3 = \langle l, r \mid r^2 = e, l^2 = e, (lr)^3 = e \rangle

Towards Alternating Groups

Any given permutation can be written as a product of permutation. Consider, for example, the above equalities

  • 1 = rl = lrlr = (lr)^3rl. These have 2, 4, and 8 permutations, respectively. 
  • 5 = lrl = lrl^3 = l^3r^3l^3. These have 3, 5, and 9 permutations, respectively.

Did you notice any patterns in the above lists? All expressions for 1 require an even number of transpositions, and all expressions of 5 require an odd number. In other words, the parity (evenness or oddness) of a given permutation doesn’t seem to be changing. In fact, this observation generalizes:

Theorem 10. For any given permutation, the parity of its transpositions is unique.

Thus, we can classify permutations by their parity. Let’s do this for S_3:

  • 0, 1, 2 are even permutations.
  • 3, 4, 5 of odd permutations.

Theorem 11. Exactly half of S_n are even permutations, and they form a group called the alternating group A_n. Just as \lvert S_n \rvert = n!, A_n has \dfrac{n!}{2} elements. 

Why don’t odd permutations form a group? For one thing, it doesn’t contain the identity permutation, which is always even.

Let’s examine A^3 = (\left\{ 0, 1, 2 \right\}, \bullet ) in more detail. Does it remind you of anything?

It is isomorphic to the cyclic group C_3 ..!

Symmetric Group_ C3 A3 Isomorphism (1)

We have so far identified the following isomorphisms: S_3 \cong D_3 and A_3 \cong C_3. Is it also true that e.g., S_4 \cong D_4 and A_4 \cong C_4?

No! Recall that the \lvert D_n \rvert=2n and \lvert C_n \rvert = nOnly n \neq 3, these sets are not even potentially isomorphic.  For example:

  • \lvert D_4 \rvert=2 \times 4 = 8 \neq 24 = 4! = \lvert S_4 \rvert.
  • \lvert C_4 \rvert= 4 \neq 12 = \frac{24}{2} = \lvert A_4 \rvert.

For these larger values of n, the symmetric group is much larger than dihedral and cyclic groups.


Why do symmetric & alternating groups matter? Let me give two answers.

Perhaps you have seen the quadratic equation, the generic solution to quadratic polynomials ax^2 + bx + c = 0.

x = \dfrac{-b \pm \sqrt{b^2 - 4ac}}{2a}

Analogous formulae exist for cubic (degree-3) and quartic (degree-4) polynomials. 18th century mathematics was consumed by the theory of equations: mathematicians attempting to solve quintic polynomials (degree-5). Ultimately, this quest proved to be misguided: there is no general solution to quintic polynomials.

Why should degree-5 polynomials admit no solution? As we will see when we get to Galois Theory, it has to do with the properties of symmetric group S_5.

A second reason to pay attention to symmetric groups comes from the classification theorem of finite groups. Mathematicians have spent decades exploring the entire universe of finite groups, finding arcane creatures such as the monster group, which may or may not explain features of quantum gravity.

One way to think about group space is by the following periodic table:

Symmetric Group_ Periodic Table

Image courtesy of Samuel Prime

Crucially, in this diverse landscape, the symmetric group plays a unique role:

Theorem 12: Cayley’s Theorem. Every finite group is a subgroup of the subgroup S_n, for some sufficiently large n.

For historical reasons, subgroups of the symmetric group are usually called permutation groups.

Until next time.

Wrapping Up


  • The symmetric group S_n is set of all bijections (permutations) over some set of n objects, closed under function composition.
  • Permutations can be decomposed into disjoint cycles: cycle notation uses this fact to provide an algorithm to solve for arbitrary compositions.
  • All permutations (and hence, the symmetric group) can be generated by adjacent transpositions. This allows us to construct a presentation of the symmetric group.
  • Permutations have unique parity: thus we can classify permutations as even or odd. The group of even presentations is called the alternating group A_n.
  • It can be shown that S_3 \cong D_3 and A_3 cong C_3. However, for larger n, the symmetric and alternating group are much larger than cyclic and dihedral groups.

The best way to learn math is through practice! If you want to internalize this material, I encourage you to work out for yourself the Cayley table & Cayley diagram for S_4. 🙂

Related Resources

For a more traditional approach to the subject, these Harvard lectures are a good resource.

An Introduction to Geometric Group Theory

Part Of: Algebra sequence
Followup To: An Introduction to Abstract Algebra
Content Summary: 1500 words, 15 min read

An Example Using Modular Addition

Last time, we saw algebraic structures whose underlying sets were infinitely large (e.g., the real numbers \mathbb{R}). Are finite groups possible?

Consider the structure ( \left\{ 0, 1, 2, 3 \right\}, +). Is it a group? No, it isn’t even a magma: 2 + 3 \not\in \left\{ 0, 1, 2, 3 \right\}! Is there a different operation that would produce closure?

Modular arithmetic is the mathematics of clocks. Clocks “loop around” after 12 hours. We can use modulo-4 arithmetic, or +_{4}, on  \left\{ 0, 1, 2, 3 \right\}. For example, 2 +_{4} 3 = 1.

To check for closure, we need to add all pairs of numbers together, and verify that each sum has not left the original set. This is possible with the help of a Cayley table. You may remember these as elementary school multiplication tables 😛 .

Geometrical Group Theory_ C4 Cayley Table Modular Arithmetic

By inspecting this table, we can classify Z_4 = ( \left\{ 0, 1, 2, 3 \right\} ), +_{4}).

  1. Does it have closure? Yes. Every element in the table is a member of the original set.
  2. Does it have associativity? Yes. (This cannot be determined by the table alone, but is true on inspection).
  3. Does it have identity? Yes. The rows and columns associated with 0 express all elements of the set.
  4. Does it have inverse? Yes. The identity element appears in every row and every column.
  5. Does it have commutativity? Yes. The table is symmetric about the diagonal.

Therefore, Z_4 is an abelian group.

An Example Using Roots of Unity

Definition 1. A group is said to be order n if its underlying set has cardinality n.

So Z_4 is order 4. What other order-4 structures exist?

Consider the equation i^4 = -1. Its solutions, or roots, is the set \left\{ 1, i, -1, -i \right\}. This set is called the fourth roots of unity.

So what is the Cayley table of this set under multiplication R_{4} = ( \left\{ 1, i, -1, -i \right\}, *)? In the following table, recall that i = \sqrt{-1}, thus i^2 = (sqrt{-1})^2 = -1.

Geometric Group Theory_ C4 Cayley Table Roots of Unity (1)

Something funny is going on. This table (and its colors) are patterned identically to Z_4! Recall that a binary operation is just a function f : A \times A \rightarrow A. Let’s compare the function maps of our two groups:

Cyclic Groups_ Binary Operation as Function (2)

These two groups for structurally identical: two sides of the same coin. In other words, they are isomorphic, we write Z_{4} \cong R_{4}. Let us call this single structure C_4.

But why are these examples of modular arithmetic and complex numbers equivalent?

One answer involves an appeal to rotational symmetry. Modular arithmetic is the mathematics of clocks: the hands of the clock rotating around in a circle. Likewise, if the reals are a number line, complex numbers are most naturally viewed as rotation on a number plane.

This rotation interpretation is not an accident. It helps use more easily spot other instances of C_4. Consider, for instance, the following shape.

Geometric Group Theory_ Rotational Symmetry Object

On this shape, the group of rotations that produce symmetry is W_4 = (\left\{ 0, 90, 180, 270 \right\}, \text{rotate}). Inspection reveals that this, too, is isomorphic to C_{4}!

Towards The Presentation Formalism

We describe C_3 as a cyclic group, for reasons that will become clear later. 

Theorem 2. For every cyclic group C_n, there exists some generator g in its underlying set such that every other set element can be constructed by that generator.

Definition 3. When a generator has been identified, we can express a group’s underlying set with generator-dependent names. Two notation are commonly used in practice:

  1. In multiplicative notation, the elements are renamed \left\{ e, r, r^2, r^3 \right\}, where r is any generator.
  2. Similarly, in additive notation, the elements become \left\{ e, r, 2r, 3r \right\}.

Geometric Group Theory_ Multiplicative and Additive Notation (1)

These two notation styles are interchangeable, and a matter of taste. In my experience, most mathematicians prefer multiplicative notation.

What generators exist in C_4? Let’s look at our three instantiations of this group:

  • In modular arithmetic, you can recreate all numbers by 0 + 1 + 1 + \ldots. But you can also recreate them by 0 + 3 + 3 + \ldots.
  • In complex numbers, you can visit all numbers by multiplying by i, or multiplying by -i. Only -1 fails to be a generator.
  • In our rotation symmetry shape, two generators exist: clockwise 90 \circ rotation, and counterclockwise 90 \circ rotation.

For now, let’s rename all elements of C_{4} to be C_4 = (\left\{ 0, 1, 2, 3 \right\}, +)  = \langle 1 \rangle = \langle 3 \rangle.

Okay. But why is 2 not a generator in C_4?

Theorem 4. For finite groups of order n, each generator must be coprime to n. That is, their greatest common divisor \text{gcd}(g, n) = 1.

  • 2 not a generator in C_4 because it is a divisor of | \left\{ 0, 1, 2, 3 \right\} | = 4.
  • What are the generators in C_5? All non-identity elements: C_{5} = \langle 1 \rangle = \langle 2 \rangle =  \langle 3 \rangle =  \langle 4 \rangle.
  • What are the generators in C_6? Only 1 and 5: C_{5} = \langle 1 \rangle = \langle 5 \rangle.

We just spent a lot of words discussing generators. But why do they matter?

Generators are useful because they allow us to discover the “essence” of a group. For example, the Rubik’s cube has 5.19 \times 10^{20} configurations. It would take a long time just writing down such a group. But it has only six generators (one for a 90 \circ rotation along each of its faces) which makes its presentation extremely simple.

Another way to think about it is, finding generators is a little bit like identifying a basis in linear algebra.

Towards Cayley Diagrams

Definition 5. We are used to specifying groups as set-operator pairs. A presentation is an generator-oriented way to specify the structure of a group. A relator is defined as constraints that apply to generators. A presentation is written \langle \text{generators} \mid \text{relators} \rangle

  • In multiplicative notation: C_4 = \langle r \mid r^3 = e \rangle.
  • In additive notation: C_4 = \langle r \mid 3r = e \rangle.

The = e suffix is often left implicit from presentations (e.g., C_4 = \langle r  \mid r^n \rangle) for the sake of concision.

Definition 6. A Cayley diagram is used to visualize the structure specified by the presentation.  Arrow color represents the generator being followed.

Note that Cayley diagrams can be invariant to your particular choice of generator:

Geometric Group Theory_ Cayley Diagram (1)

The shape of the Cayley diagram explains why C_3 is called a cyclic group, by the way!

With these tools in hand, let’s turn to more complex group structures.

Dihedral Groups

Cyclic groups have rotational symmetry. Dihedral groups have both rotational and reflectional symmetry. The dihedral group that describes the symmetries of a regular n-gon is written D_{n}. Let us consider the “triangle group” D_{3}, generated by a clockwise 120\circ rotation r and a horizontal flip f.

With triangles, we know that three rotations returns to the identity r^3 = e. Similarly, two flips returns to the identity f^2 = e. Is there some combination of rotations and flips that are equivalent to one another? Yes. Consider the following equality:

Geometric Group Theory_ Rotation vs Reflection Equivalence (2)

Analogously, it is also true that rf = fr^2.

Definition 7. Some collection of elements is a generating set if combinations amongst only those elements recreates the entire group.

Cyclic groups distinguish themselves by having only one element in their generating set. Dihedral groups require two generators.

We can write each dihedral group element based on how it was constructed by the generators:

D_n = \left\{ e, r, r^2, \ldots, r^n-1, f, rf, r^2f, \ldots, r^{n-1}f \right\}

Alternatively, we can instead just write the presentation of the group:

D_{3} = \langle r, f \mid r^3 = 1, f^2 = 1, r^2f = fr, rf = fr^2  \rangle.

We can visualize this presentation directly, or as a more abstract Cayley graph:

Geometric Group Theory_ Dihedral Groups Intro (3)

The Cayley table for this dihedral group is:

Geometrical Group Theory- Dihedral Cayley Table

This shows that D_3 is not abelian: its multiplication table is not symmetric about the diagonal.

By looking at the color groupings, one might suspect it is possible to summarize this 6 \times 6 table with a 2 \times 2 table. We will explore this intuition further, when we discuss quotients.

Until next time.

Wrapping Up


  • Finite groups can be analyzed with Cayley tables (aka multiplication tables).
  • The same group can have more than one set-operation expressions (e.g., modular arithmetic vs. roots of unity vs. rotational symmetry).
  • Generators, elements from which the rest of the set can be generated, are a useful way to think about groups.
  • Group presentation is an alternate way to describing group structure. We can represent presentation visually with the help of a Cayley diagram.
  • Cyclic groups (e.g., C_3) have one generator; whereas dihedral groups (e.g., D_3) have two.

Related Resources

  • This post is based on Professor Macaulay’s Visual Group Theory lectures, which in turn is based on Nathan Carter’s eponymous textbook.
  • Related to this style of teaching group theory are Dana Ernst’s lecture notes.
  • If you want to see explore finite groups with software, Group Explorer is excellent.
  • For a more traditional approach to the subject, these Harvard lectures are a good resource.

Isotopy in Ambient Manifolds

Part Of: Analysis sequence
Followup To: An Introduction to Topology
Content Summary: 1300 words, 13 min read

Degenerate Geometries

Two lines can either be parallel, or not. There exist unending variations of both situation. But which is more common, on the average?

Consider two lines y_{1} = m_{1}x + b_{1} and y_{2} = m_{2}x + b_{2} When do we call these lines parallel? When their slopes are equal m_{1} = m_{2}. We can gain insight into the situation by mapping parameter space, where m_{1} and m_{2} form the horizontal and vertical axes respectively.

The red line does not represent one line, but an infinite set of parallel lines where m_{1} = m_{2}.

Isotopy Invariance_ Degenerate Geometries Parallel Lines

Suppose we start somewhere on the red line (with some pair of parallel lines)$. Perturbation of the slopes of these lines corresponds to a random walk beginning at that point. The more you walk around on the plane, the more likely you are to stand in green territory (slopes whose lines are not parallel).

Definition 1. A situation holds generically if, by that perturbing its constituent properties, it tends to default to that situation.

This concept applies in many situations. For example:

  • In two dimensions, two lines might be parallel, but generically intersect at a point.
  • In three dimensions, two planes might be parallel, but generically intersect in a line.
  • In linear algebra, a matrix might be singular is its determinant is equal to zero. But the determinant becomes non-zero on perturbation of individual matrix values. Thus, matrices are generically invertible.

This concept has also been called general position.

Manifold Intersection & Overflow

We now turn to the general science of intersection. The following observations hold generically:

  • In two dimensions, a point and another point do not intersect.
  • In two dimensions, a point and a line do not intersect.
  • In two dimensions, a line and a line do intersect, at a point.
  • In three dimensions, a line and another line do not intersect.
  • In three dimensions, a line and a plane do intersect, at a point.
  • In three dimensions, a plane and another plane do intersect, at a line.

Points, lines, planes… these are manifolds! Some of them infinitely large, but manifolds nonetheless! Let’s use the language of topology to look for patterns.

Each of the above examples contains two submanifolds K and L being placed in an ambient manifold M. We denote their intersection as K \cap L. Let us compare the dimensions of these three manifolds to the dimension of their overlap.

We represent dim(M), dim(K), dim(L), dim(K \cap L) as m, k, l, k \cap l respectively. Now our examples can be expressed as 4-tuples (m, k, l, k \cap l):

  • (2, 0, 0, \varnothing)
  • (2, 0, 1, \varnothing)
  • (2, 1, 1, 0)
  • (3, 1, 1, \varnothing)
  • (3, 1, 2, 0)
  • (3, 2, 2, 1)

In four dimensions, would we expect two planes to intersect; and if so, what would we expect the dimension of the intersection? Put differently, what would we predict to be the value of x in (4, 2, 2, x)?

If you guessed x=0, that the two planes intersect at a point, you have noticed the pattern!

Definition 2. Consider two submanifolds embedded in an ambient manifold K, L \subseteq M. The overflow is defined as o = (k+l) - m.

Theorem 3. Let K, L \subseteq M. The following properties are true, generically:

  • If o < 0, the submanifolds do not intersect: K \cap L = \varnothing
  • If o \geq 0, the intersection is non-empty K \cap L \neq \varnothing, and dim( K \cap L )= o

To see why this is the case, consider basis vectors in linear algebra. An m-dimensional space requires an m-dimensional basis vector. Submanifold dimensions are then “placed within” the ambient basis. If we try to minimize the overlap between two submanifolds, the equation for overflow falls out of the picture. 🙂

Isotopy_ Linear Algebra Basis and Overflow

Overflow during Motion

We have considered overflow in static submanifolds. But what if we move one of them?

The following observations hold generically:

  • In two dimensions, moving a point across a point do not intersect.
  • In two dimensions, moving a point across a line intersect, at a point.
  • In two dimensions, moving a line across another line intersect, across the entire line.
  • In three dimensions, moving a line across another line intersect, at a point.
  • In three dimensions, moving a line across a plane intersect, at a line.
  • In three dimensions, moving a plane across another plane intersect, across the entire plane.

Compare these “motion” (m, k, l, k \cap l) tuples against our previous tuples:

  • (2, 0, 0, \varnothing) vs. (2, 0, 0, \varnothing)
  • (2, 0, 1, 0) vs. (2, 0, 1, \varnothing)
  • (2, 1, 1, 1) vs. (2, 1, 1, 0)
  • (3, 1, 1, 0) vs. (3, 1, 1, \varnothing)
  • (3, 1, 2, 1) vs. (3, 1, 2, 0)
  • (3, 2, 2, 2) vs. (3, 2, 2, 1)

The dynamic overflow is one dimension larger than the static overflow. Why? 

The way I like to think about it is, by moving K across time, you are effectively enlisting an extra dimension (time). A moving point starts to look like a string, etc.

Theorem 4.  Let K, L \subseteq M. Suppose we want to move K from one side to the other side of L. This crossing is said to be topologically possible if K and L do not intersect at any point during the transition. The possibility of a crossing depends on the overflow o:

  • If o < -1, then generically, a crossing is possible (at all times, K \cap L = \varnothing)
  • If o \geq -1, then generically, a crossing is not possible (at some time, K \cap L \neq \varnothing and dim( K \cap L )= o

Towards Isotopy

Let’s put this theorem to work.

Example 5. In what R^m is the following movement possible?

Isotopy_ Two Line Crossover

We know from physical experience that this is not possible in our three-dimensional universe R^{3}. But Theorem 4 says that, if the ambient dimension is four, then the overflow is (1+1) - 4 = -2 < -1, so crossing is possible.

Intuitively, this makes sense. In three dimensions, a point can “hop” over a line by moving into the “extra” dimension. Similarly, the line K can cross over L by moving into the fourth dimension.

We can generalize this notion of successful crossing as follows:

Definition 6. Imagine moving submanifold K through an ambient manifold M. Let K_{0} and K_{1} represent the beginning and end positions as it travels throughout time t \in [0,1]. If K_{t} never has self-intersection at any time t, we say K_{0} is isotopic to K_{1} in M, and write K_{0} \sim_{M} K_{1}

Isotopy is especially relevant to knot theory. A classic example is the trefoil knot, a simple kind of knot. The string trefoil knot composed of 1-dimensional string is not isotopic to the unknot (1-torus) in \mathbb{R}^3: this is why it is called non-trivial.

Example 7. Show how the trefoil knot can isotope to the unknot.

  1. From K_{0} to K_{0.5}: we can unwind the trefoil knot in \mathbb{R}^4: simply lift the top-left string up.
  2. From K^{0.5} to K^{1}: you only need \mathbb{R}^3 to finish the job. Simply pull the top loop down, and then untwist.

Isotopy_ Unknotting Trefoil (1)

So knot theory is only interesting in \mathbb{R}^3! In \mathbb{R}^2, knots self-intersect. In \mathbb{R}^4, they are all isotopic to the unknot.

More Isotopy Examples

Let’s get some more practice under our belt.

Example 8. Consider a genus-two torus with its holes intertwined. What ambient manifold do we require to undo the knot?

Isotope_ Solution in R6 (1)

We might imagine the solution isotope would require us to pull apart the “hands” directly. And it is true that we can successfully isotope in M = \mathbb{R}^{6}; after all o = (2+2) - 6 = -2 < -1.

But do we really need six dimensions to get this done? Or can we do better? It turns out that we only really need M = \mathbb{R}^{3}:

Isotope_ Solution in R3

As the number of ambient dimensions increases, finding an isotope becomes increasingly easy. Thus, we often strive to find the smallest possible ambient manifold.

Example 9. Consider a genus-3 torus (\Sigma^{3}). If we isotope along its surface using M = \Sigma^{3}, is K_{1} \sim K_{2}? How about L_{1} \sim L_{2}

Isotopy_ Genus 3 Torus (1)

It is easy to see that K_{1} \sim_{\Sigma^{3}} K{2}. You just pull the circle left, along that surface of the torus.

But it takes some time to see that L_{1} \nsim L{2}. You might suspect you can just pull the blue circle over the middle hole. But that would require leaving the surface \Sigma^{3}. Thus L_{1} \nsim_{\Sigma^{3}} L{2} (but L_{1} \sim_{\mathbb{R}^{3}} L{2}).

Related Materials

This post is based on Dr. Tadashi Tokeida’s excellent lecture series, Topology & Geometry. For more details, check it out!

An Introduction to Set Theory

Part Of: Algebra sequence
Content Summary: 1800 words, 18 min read

Fundamentals of Sets

Definition 1. A set is a collection of distinct objects.

A few examples to kick us off:

  • MyFavoriteFruits = \left\{ apples, persimmon, pineapple \right\} represents the set of fruits which I prefer.
  • A = \left\{ 1, 2, 3, 4, 5 \right\} can represent, among other things, the fingers on my left hand.
  • The set of natural numbers \mathbb{N} = \left\{ 0, 1, 2, 3, 4, ... \right\}.
  • The set of integers \mathbb{Z} = \left\{ ..., -2, -1, 0, 1, 2, ... \right\}.

Sets are subject to the following properties:

  • Order blindness: \left\{2, 1, 3\right\} and \left\{1, 2, 3\right\} expresses the very same set.
  • Duplicate blindness. \left\{ 1, 1, 2, 3, 3 \right\} = \left\{1, 2, 3 \right\}. We will prefer to express sets with the latter, more compact, notation.
  • Recursion. \left\{ 1, \left\{2, 3\right\} \right\} is a perfectly valid two-element set, quite distinct from the three-element set \left\{ 1, 2, 3 \right\}.

Definition 2. Two sets A and B are said to be equal (A = B) if A and B contain exactly the same elements.

  • Let A = \left\{ 1, 2, 3 \right\} and B = \left\{ 3, 1, 2 \right\}. Then, A = B.
  • Let A = \left\{ 1, 2, 3 \right\} and B = \left\{ 1, \left\{2, 3\right\} \right\}. Then, A \neq B.

Definition 3. If an object x is an element of (i.e., member of) set S, we write x \in S. Otherwise, we write x \notin S.

  • Let PrimaryColors = \left\{ red, yellow, blue \right\}. Then yellow \in PrimaryColors means “yellow is an element of the set of primary colors”.
  • -1 \notin \mathbb{N} means “-1 is not an element of the natural numbers”.
  • 1 \notin B = \left\{ 0, \left\{ 1 \right\}, \left\{ \left\{ 2 \right\} \right\} \right\}. The element 1 is not in B: only the set \left\{ 1 \right\} is.

Definition 4. For some set X, its cardinality (i.e., size) |X|, is the number of elements in that set.

  • Let A = \left\{ 1, 2, 3, 4, 5 \right\}. Then |A| = 5.
  • Let B = \left\{ 1, \left\{2, 3 \right\} \right\}. Then |B| = 2. Note that cardinality only looks at “the outer layer”.

Definition 5. The empty set (i.e., the null set) \varnothing is the set containing no elements.

  • \varnothing = \left\{ \right\}.
  • | \varnothing | = 0.
  • | \left\{ \varnothing \right\} | = 1

Definition 6. Instead of listing all elements, set builder notation specifies rules or properties that govern the construction of a set. The notation takes the following form: { x | \normalfont{property of} x }, where the | symbol is pronounced “such that”.

  • A = \left\{ 1, 2, 3, 4, 5 \right\} = \left\{ x \in \mathbb{Z} | x > 0, x < 6 \right\}. In words “let A be the set of integers X such that x is greater than zero and less than six.”
  • The set of rational numbers \mathbb{Q} = \left\{ x / y \mid x \in \mathbb{Z}, y \in \mathbb{Z}, y \neq 0 \right\}.

Sets defined by their properties are often called set comprehension. Such properties or rules are called the intension, the resultant set is called the extension.

  • Let A = \left\{ y | x \in \mathbb{R}, y = (x+1)(x-1) \right\} and let B = \left\{ y | x \in \mathbb{R}, y = x^{2} -1 \right\}. Here A = B, despite their use of different rules. We say that A and B have the same extension, but different intensions.

The intension-extension tradeoff denotes an inverse relationship between the number of intensional rules versus the size of the set those rules pick out. Let’s consider two examples to motivate this tradeoff.

Consider hierarchical addressing in computer architecture. Suppose we have 2^{6} = 64 bits of computer memory, each bit of which is uniquely identified with a 6-bit address. Suppose further that our memory has been allocated to data structures of varying size. To promote addressing efficiency, a computer can adopt the following strategy: assign shorter addresses to larger variables.

Naive Set Theory- Hierarchical Addressing

We can also see the intension vs. extension tradeoff in the memory systems of the brain. Specifically, semantic memory is organized into a concept hierarchy. We might classify a Valentine’s day gift according to the following tree:

Naive Set Theory- Concept Hierarchy

The number of objects classified as a RED_ROSE is clearly less than the number of objects classified as LIVING_THING. But as our extensional size decreases, the size of our intension (the number of properties needed to classify a RED_ROSE) increases.

Subsets and Power set

Definition 7. A set A is a subset of another set B, written A \subseteq B, if every element of A is also an element of B.

  • Let A = \left\{ 2, 3, 9 \right\} and B = \left\{1,9,3,6, 2\right\}. Then A \subseteq B (recall that element order is irrelevant).
  • Let A = \left\{ 2, 3, 9 \right\} and C = \left\{1,7, 3, 6, 2 \right\}. Then A \nsubseteq C (C does not contain 9).
  • Is \varnothing a subset of \left\{2,3,9\right\}?  Yes. For all A, \varnothing \subseteq A is true.

Definition 8. A set A is a proper subset of another set B, written A \subset B, if A \subseteq B and A \neq B.

Definition 9. For a given set A, its power set \mathbb{P} (A) is the set of all subsets of A.

  • Let A = \left\{ 0, 1 \right\}. Then \mathbb{P}(A) = \left\{ \varnothing ,  \left\{ 0 \right\} , \left\{ 1 \right\}, \left\{ 0, 1 \right\} \right\}.

A power set can be constructed by the use of a binary tree, as follows:

Naive Set Theory- Building Powersets (1)

As can be seen above, the total number of subsets must be a power of two. Specifically, if |A| = n, then |\mathbb{P}(A)| = 2^n.

It is important to get clear on the differences between the element-of ( \in ) versus subset-of ( \subseteq ) relations. Consider again A = \left\{ 0, 1 \right\} and its power set \mathbb{P}(A) = \left\{ \varnothing ,  \left\{ 0 \right\} , \left\{ 1 \right\}, \left\{ 0, 1 \right\} \right\}

  • A \in \mathbb{P}(A). But \left\{A\right\} \not\in \mathbb{P}(A). The \in relation requires the brackets match.
  • \left\{A\right\} \subseteq \mathbb{P}(A). But A \nsubseteq \mathbb{P}(A). The \subseteq relation requires the “extra bracket”.

Probability theory is intimately connected with the notion of power set. Why? Because many discrete probability distributions have \sigma-algebras draw from the power set of natural numbers \mathbb{P} ( \mathbb{N} ).

Cartesian Product, Tuples

Definition 10. Given two sets A and B, their Cartesian product A \times B is the set of all ordered pairs \langle a, b \rangle such that a \in A and b \in B. Note that, unlike the elements in a set, the elements of an ordered pair cannot be reordered.

  • Let A = \left\{1, 2, 3 \right\} and B = \left\{4, 5, 6\right\}. Then A \times B = \left\{ (1, 4) , (1, 5), (1,6), (2, 4), (2, 5), (2, 6),  (3, 4), (3, 5), (3, 6) \right\}.

We can represent this same example visually, as follows:

Naive Set Theory- Cartesian Product (1)

  • Contrast this with B \times A = \left\{ (3, 1) , (4, 1), (2, 4), (1, 3) \right\}. Thus, A \times B \neq B \times A. This is because elements within ordered pairs cannot be rearranged.
  • Note that |A \times B| = 9 = |A_{1}| \times |A_{2}|. In combinatorics, this observation generalizes to the multiplication principle.
  • The real plane \mathbb{R}^2 = \mathbb{R} \times \mathbb{R} is a well-known example of a Cartesian product.

Definition 11. Given n sets, A_{1}, A_{2}, \ldots , A_{n}, their Cartesian product is the set of all n-tuples.

  • Let A = \left\{1, 2\right\}, B = \left\{a, b\right\} and C = \left\{ 100 \right\}. Now A \times B \times C = \left\{ (1, a, 100) , (1, b, 100), (2, a, 100), (2, b, 100) \right\}.

Linear algebra is intimately connected with the Cartesian product operation. Why? Because n-tuples are strongly related to n-dimensional vectors. 

Intersection and Union

Definition 12. The intersection of two set A and B, written A \cap B , is the set of elements common to both sets.

  • Let A = \left\{ 2, 4, 9 \right\} and B = \left\{1, 2, 3, 6, 9 \right\}. Then A \cap B = \left\{ 2, 9 \right\}.
  • Let A = \left\{ 2, 4, 9 \right\} and B = \left\{1, 3, 6, 8, 10\right\}. Then A \cap B = \varnothing.
  • Let A = \left\{ 2, 4, 9 \right\}. Then \varnothing \cap A = A \cap \varnothing = \varnothing.

Definition 13. The union of two sets A and B, written A \cup B is the set of all elements that are in A or B, or both.

  • Let A = \left\{ 2, 4, 9 \right\} and B = \left\{1, 2, 3, 6, 9 \right\}. Then A \cup B = \left\{ 1, 2, 3, 4, 6, 9 \right\}.
  • Let A = \left\{ 2, 4, 9 \right\}. Then \varnothing \cup A = A \cup \varnothing = A.

Venn diagrams represent sets as enclosed areas in a 2D plane. If two (or more!) sets have shared elements, their areas overlap. We can use this technique to visualize sets and their overlap:

Naive Set Theory- Elements vs Venn Diagram

We can also use Venn diagrams to represent our intersection and union relations:

Naive Set Theory- Union vs Intersection Venn (1)

Note that |A \cup B| = |A| + |B| - |A \cap B|. This makes sense in light of the Venn diagram. Adding the cardinality of both sets counts the elements that exist in the middle section twice. To avoid this, we subtract the cardinality of the intersection. In combinatorics, this formula is generalized by the inclusion-exclusion principle

Difference and Complement

Definition 14. Given two sets A and B, their difference A \setminus B is the set of elements in A but not also in B.

  • Let A = \left\{ 1, 2, 3, 4, 5, 6 \right\} and B = \left\{0, 2, 4, 6, 8 \right\}. Then A \setminus B = \left\{ 1, 3, 5 \right\} and B \setminus A = \left\{ 0, 8 \right\}.
  • Let A = \left\{ 1, 2, 3, 4, 5, 6 \right\} and B = \left\{7, 8, 9, 10 \right\}. Then A \setminus B = A and B \setminus A = B.

Definition 15. Given two sets A and B, the symmetric difference A \triangle B is the set of elements in A or B, but not both.

  • Let A = \left\{ 1, 2, 3, 4, 5, 6 \right\} and B = \left\{0, 2, 4, 6, 8 \right\}. Then A \triangle B = \left\{ 0, 1, 3, 5, 8 \right\}.

Definition 16. In many set problems, all sets are defined as subsets of some reference set. This reference set is called the universe U.

  • Let A = \left\{ 1+i, 12-8j, 3+0i \right\} and let its universe be the set of complex numbers \mathbb{C}. It is true that A \subseteq \mathbb{C}.

Definition 17. Relative to a universe U, the complement of A, written \overline{A} is the set of all elements of U not contained in A.

  • Let U be the set of positive integers less than 10: U = \left\{ x | x \in \mathbb{Z}^{+}, x < 10 \right\} and A = \left\{ 1, 2, 3, 4, 5 \right\}. Then \overline{A} = \left\{ 6, 7, 8, 9 \right\}.

We can again represent these relations graphically, via Venn diagrams:

Naive Set Theory- Complement and Difference Venn (2)




Let me summarize this post in terms of our 17 definitions 🙂

  • Def 1-5 introduced the notion of set, set equality, the element-of operator, cardinality (set size), and empty set.
  • Def 6 introduced set builder notation, and the intension-extension tradeoff.
  • Def 7-9 introduced subset, proper subset, and power set.
  • Def 10-11 introduced Cartesian product, ordered pairs, and n-tuples.
  • Def 12-13 introduced intersection (“and”) and union (“or”), as well as Venn diagrams.
  • Def 14-17 introduced difference, symmetric difference (“xor”), and complement (“not”).

Want to learn more? I recommend the following resources:

This introductory article focused on promoting intuitions through worked examples. Next time, we’ll look at these same operations more carefully, and examine the relationship between set theory and classical predicate logic.

An Introduction to Topology

Part Of: Analysis sequence
Content Summary: 1000 words, 10 min read

Motivating Example

Can you draw three lines connecting A to A, B to B, and C to C?  The catch: the lines must stay on the disc, and they cannot intersect.

Topology- Motivating Problem (2)

Here are two attempts at a solution:

Topology- Potential Solutions (1)

Both attempts fail. In the first, there is no way for the Bs and Cs to cross the A line. In the second, we have made more progress… but connecting C is impossible.

Does any solution exist? It is hard to see how…

Consider a simplified puzzle. Let’s swap the inner points B and C.

Topology- Original vs Easy Puzzle (2)

In the new puzzle, the solution is easy: just draw straight lines between the pairs!

To understand where this solution breaks down, let’s use continuous deformation (i.e., homeomorphism) to transform this easier puzzle back to the original. In other words, let’s swap point B towards C, while not dropping the “strings” of our solution lines:


Deformation has led us to the solution! Note what just happened: we solved an easy problem, and than “pulled” that solution to give us insight into a harder problem.

As we will see, the power of continuous deformation extends far beyond puzzle-solving. It resides at the heart of topology, one of mathematics’ most important disciplines.

Manifolds: Balls vs Surfaces

The subject of arithmetic is the number. Analogously, in topology, manifolds are our objects. We can distinguish two kinds of primitive manifold: balls and surfaces.

Topology- Balls and Surfaces (1)

These categories generalize ideas from elementary school:

  • A 1-ball B^1 is a line segment
  • A 2-ball B^2 is a disc
  • S^1 is a circle
  • S^2 is a sphere

Note the difference between volumes and their surfaces. Do not confuse e.g., a disc with a circle. The boundary operation \partial makes the volume-surface relationship explicit. For example, we say that \partial B^2 = S^1.

Note that surfaces are one dimension below their corresponding volume. For example, a disc resides on a plane, but a circle can be unrolled to fit within a line.

Importantly, an m-ball and an m-cube are considered equivalent! After all, they can be deformed into one another. This is the reason for the old joke:

A topologist cannot tell the difference between a coffee cup and a donut. Why? Because both objects are equivalent under homeomorphism:


If numbers are the objects of arithmetic, operations like multiplication act on these numbers. Topological operations include product, division, and connected sum. Let us address each in turn.

On Product

The product (x) operation takes two manifolds of dimension m and n, and returns a manifold of dimension m+n. A couple examples to whet your appetite:

Topology- Examples of Product (1)

These formulae only show manifolds of small dimension. But the product operation can just as easily construct e.g. a 39-ball as follows:

B^{39} = \prod_{i=1}^{39} I^1

How does product relate to our boundary operator? By the following formula:

\partial (M x N) = ( \partial M x N) \cup (M x \partial N )

This equation, deeply analogous to the product rule in calculus, becomes much more clear by inspection of an example:
Topology- Product vs Boundary (1)

On Division

Division ( / ) glues together the boundaries of a single manifold. For example, a torus can be created from the rectangle I^{2}:


We will use arrows to specify which edges are to be identified. Arrows with the same color and shape must be glued together (in whatever order you see fit).

Topology- Division Simple Examples (2)

Alternatively, we can specify division algebraically. In the following equation, x=0 means “left side of cylinder” and x=1 means right side:

S^1 x I^1 = Cylinder = \frac{I^2}{(0,y) \sim (1, y) \forall y}

The Möbius strip is rather famous for being non-orientable: it neither has an inside nor an outside. As M.C. Escher once observed, an ant walking on its surface would have to travel two revolutions before returning to its original orientation.

More manifolds that can be created by division on I^{2}. To construct a Klein bottle by division, you take a cylinder, twist it, and fold it back on itself:

Topology- Klein Bottle Construction (5)

In our illustration, there is a circle boundary denoting the location of self-intersection. Topologically, however, the Klein bottle need not intersect itself. It is only immersion in 3-space that causes this paradox.

Our last example of I^{2} division is the real projective plane RP^{2}. This is even more difficult to visualize in 3-space, but there is a trick: cut I^{2} again. As long as we glue both pieces together along the blue line, we haven’t changed the object. 

Topology- Deriving Real Projective Plane First Part (1)

The top portion becomes a Möbius strip; the bottom becomes a disc. We can deform a disc into a sphere with a hole in it. Normally, we would want to fill in this hole with another disc. However, we only have a Möbius strip available.

But Möbius strips are similar to discs, in that its boundary is a single loop. Because we can’t visualize this “Möbius disc” directly, I will represent it with a wheel-like symbol.  Let us call this special disc by a new name: the cross cap.

The real projective plane, then, is a cross cap glued into the hole of a sphere.  It is like a torus; except instead of a handle, it has an “anomaly” on its surface.

Topology- Deriving Real Projective Plane Second Part

These then, are our five “fundamental examples” of division:

Topology- Division Overview (3)

On Connected Sum

Division involves gluing together parts of a single manifold. Connected sum (#), also called surgery, involves gluing two m-dimensional manifolds together. To accomplish this, take both manifolds, remove an m-ball from each, and identify (glue together) the boundaries of the holes. In other words:

\frac{ ( M_1 / B_1 ) \cup ( M_2 / B_2 ) }{ \partial ( M_1 / B_1 ) \sim \partial ( M_2 / B_2 )} = M_1 \# M_2

Let’s now see a couple examples. If we glue tori together, we can increase the number of holes in our manifold. If we attach a torus with a real projective plane, we acquire a manifold with holes and cross-cuts.
Topology- Connected Sum examples (3)


  • Topology, aka. “rubber sheet geometry”, is the study of malleable objects & spaces.
  • In topology, manifolds represent objects in n-dimensional space.
  • Manifolds either represent volumes (e.g., disc) and boundaries (e.g., circles)
  • Manifolds are considered equivalent if a homeomorphism connects them.
  • There are three basic topological operations:
    • Product (x) is a dimension-raising operation (e.g., square can become a cube).
    • Division (/) is a gluing operation, binding together parts of a single manifold.
    • Connected sum (#) i.e., surgery describes how to glue two manifolds together.

Related Materials

This post is based on Dr. Tadashi Tokeida’s excellent lecture series, Topology & Geometry. For more details, check it out!

OLS Estimation via Projection

Part Of: Machine Learning sequence
Content Summary: 800 words, 8 min read

Projection as Geometric Approximation

If we have a vector b and a line determined by vector a, how do we find the point on the line that is closest to b?

OLS- Geometry of Projection

The closest point p is at the intersection formed by a line through b that is orthogonal to a. If we think of p as an approximation to b, then the length of e = b - p is the error of that approximation.

a^T e = a^T (b - ax) = 0

This formula captures projection onto a vector. But what if you want to project to a higher dimensional surface?

Imagine a plane, whose basis vectors are a_1 and a_2. This plane can be described with a matrix, by mapping the basis vectors onto its column space:

A = \begin{bmatrix} a_1 & a_2 \end{bmatrix}

Suppose we want to project vector b onto this plane. We can use the same orthogonality principle as before:

A^Te = A^T(b-Ax) = 0

A^TAx = A^Tb

Matrices like A^TA are self-transpositions. We have shown that such matrices are square symmetric, and thereby contain positive, real eigenvalues.

We shall assume that the columns of A^TA are independent, and it thereby is invertible. The inverse thereby allows us to solve for x:

(A^TA)^{-1}(A^TA)x = (A^TA)^{-1}A^Tb

x = (A^TA)^{-1}A^Tb

Recall that,

p = Ax = A(A^TA)^{-1}A^Tb

Since matrices are linear transformations (functions that operate on vectors), it is natural to express the problem in terms of a projection matrix P, that accepts a vector b, and outputs the approximating vector p:

p = Pb

By combining these two formula, we solve for P:

P = A(A^TA)^{-1}A^T

Thus, we have two perspectives on the same underlying formula:

OLS- Regression Functions via Matrices (1)

Linear Regression via Projection

We have previously noted that machine learning attempts to approximate the shape of the data. Prediction functions include classification (discrete output) and regression (continuous output).

Consider an example with three data points. Can we predict the price of the next item, given its size?

OLS- Regression Function Setup

For these data, a linear regression function will take the following form:

\psi : Size \rightarrow Price

\psi(Size) = \beta_0 + \beta_1 Size

We can thus interpret linear regression as an attempt to solve Ax=b:

OLS- Linear Algebra Regression Matrix

In this example, we have more data than parameters (3 vs 2). In real-world problems, it is an extremely common predicament. It yields matrices with may more equations than unknowns. This means that Ax=b has no solution (unless all data happen to fall on a straight line).

If exact solutions are impossible, we can still hope for an approximating solution. Perhaps we can find a vector p that best approximates b. More formally, we desire some p = A\bar{x} such that the error e = b-p is minimized.

Since projection is a form of approximation, we can use a projection matrix to construct our linear prediction function \psi : Size \rightarrow Price.

OLS- Least Squares Fundamental Spaces

A Worked Example

The solution is to make the error b-Ax as small as possible. Since Ax can never leave the column space, choose the closest point to b in that subspace. This point is the projection p. Then the error vector e = b-p has minimal length.

To repeat, the best combination p = Ax is the projection of b onto the column space. The error is perpendicular to that subspace. Therefore e = b-p is in the left nullspace:

Ax = b

A^TA = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \\ \end{bmatrix}

We can use Guass-Jordan Elimination to compute the inversion:

(A^TA)^{-1} = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \\ \end{bmatrix}

A useful intermediate quantity is as follows:

(A^TA)^{-1}A^T = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \\ \end{bmatrix} = \begin{bmatrix} 4/3 & 1/3 & -2/3 \\ -1/2 & 0 & 1/2 \\ \end{bmatrix}

We are now able to compute the parameters of our model, \bar{x}:

\bar{x} = \left[ (A^TA)^{-1}A^T \right] b = \begin{bmatrix} 4/3 & 1/3 & -2/3 \\ -1/2 & 0 & 1/2 \\ \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 2/3 \\ 1/2 \\ \end{bmatrix}

These parameters generate a predictive function with the following structure:

\psi : Size \rightarrow Price

\psi(Size) = \frac{2}{3} + \frac{1}{2}Size

These values correspond with the line that best fits our original data!

Wrapping Up


  • In linear algebra, projection approximates a high-dimensional surface in a lower-dimensional space. The projection error can be measured.
  • In linear regression, we usually cannot solve Ax=b, because there tends to be more data than parameters (b is not in the column space)
  • We can find the closest vector in the column space by projecting onto b, and minimizing the projection error.
  • Thus, the operation of projection can be used to perform parameter estimation, and produce a model that best approximates the training data.

Related Resources: