An Introduction to Set Theory

Part Of: Algebra sequence
Content Summary: 1800 words, 18 min read

Fundamentals of Sets

Definition 1. A set is a collection of distinct objects.

A few examples to kick us off:

  • MyFavoriteFruits = \left\{ apples, persimmon, pineapple \right\} represents the set of fruits which I prefer.
  • A = \left\{ 1, 2, 3, 4, 5 \right\} can represent, among other things, the fingers on my left hand.
  • The set of natural numbers \mathbb{N} = \left\{ 0, 1, 2, 3, 4, ... \right\}.
  • The set of integers \mathbb{Z} = \left\{ ..., -2, -1, 0, 1, 2, ... \right\}.

Sets are subject to the following properties:

  • Order blindness: \left\{2, 1, 3\right\} and \left\{1, 2, 3\right\} expresses the very same set.
  • Duplicate blindness. \left\{ 1, 1, 2, 3, 3 \right\} = \left\{1, 2, 3 \right\}. We will prefer to express sets with the latter, more compact, notation.
  • Recursion. \left\{ 1, \left\{2, 3\right\} \right\} is a perfectly valid two-element set, quite distinct from the three-element set \left\{ 1, 2, 3 \right\}.

Definition 2. Two sets A and B are said to be equal (A = B) if A and B contain exactly the same elements.

  • Let A = \left\{ 1, 2, 3 \right\} and B = \left\{ 3, 1, 2 \right\}. Then, A = B.
  • Let A = \left\{ 1, 2, 3 \right\} and B = \left\{ 1, \left\{2, 3\right\} \right\}. Then, A \neq B.

Definition 3. If an object x is an element of (i.e., member of) set S, we write x \in S. Otherwise, we write x \notin S.

  • Let PrimaryColors = \left\{ red, yellow, blue \right\}. Then yellow \in PrimaryColors means “yellow is an element of the set of primary colors”.
  • -1 \notin \mathbb{N} means “-1 is not an element of the natural numbers”.
  • 1 \notin B = \left\{ 0, \left\{ 1 \right\}, \left\{ \left\{ 2 \right\} \right\} \right\}. The element 1 is not in B: only the set \left\{ 1 \right\} is.

Definition 4. For some set X, its cardinality (i.e., size) |X|, is the number of elements in that set.

  • Let A = \left\{ 1, 2, 3, 4, 5 \right\}. Then |A| = 5.
  • Let B = \left\{ 1, \left\{2, 3 \right\} \right\}. Then |B| = 2. Note that cardinality only looks at “the outer layer”.

Definition 5. The empty set (i.e., the null set) \varnothing is the set containing no elements.

  • \varnothing = \left\{ \right\}.
  • | \varnothing | = 0.
  • | \left\{ \varnothing \right\} | = 1

Definition 6. Instead of listing all elements, set builder notation specifies rules or properties that govern the construction of a set. The notation takes the following form: { x | \normalfont{property of} x }, where the | symbol is pronounced “such that”.

  • A = \left\{ 1, 2, 3, 4, 5 \right\} = \left\{ x \in \mathbb{Z} | x > 0, x < 6 \right\}. In words “let A be the set of integers X such that x is greater than zero and less than six.”
  • The set of rational numbers \mathbb{Q} = \left\{ x / y \mid x \in \mathbb{Z}, y \in \mathbb{Z}, y \neq 0 \right\}.

Sets defined by their properties are often called set comprehension. Such properties or rules are called the intension, the resultant set is called the extension.

  • Let A = \left\{ y | x \in \mathbb{R}, y = (x+1)(x-1) \right\} and let B = \left\{ y | x \in \mathbb{R}, y = x^{2} -1 \right\}. Here A = B, despite their use of different rules. We say that A and B have the same extension, but different intensions.

The intension-extension tradeoff denotes an inverse relationship between the number of intensional rules versus the size of the set those rules pick out. Let’s consider two examples to motivate this tradeoff.

Consider hierarchical addressing in computer architecture. Suppose we have 2^{6} = 64 bits of computer memory, each bit of which is uniquely identified with a 6-bit address. Suppose further that our memory has been allocated to data structures of varying size. To promote addressing efficiency, a computer can adopt the following strategy: assign shorter addresses to larger variables.

Naive Set Theory- Hierarchical Addressing

We can also see the intension vs. extension tradeoff in the memory systems of the brain. Specifically, semantic memory is organized into a concept hierarchy. We might classify a Valentine’s day gift according to the following tree:

Naive Set Theory- Concept Hierarchy

The number of objects classified as a RED_ROSE is clearly less than the number of objects classified as LIVING_THING. But as our extensional size decreases, the size of our intension (the number of properties needed to classify a RED_ROSE) increases.

Subsets and Power set

Definition 7. A set A is a subset of another set B, written A \subseteq B, if every element of A is also an element of B.

  • Let A = \left\{ 2, 3, 9 \right\} and B = \left\{1,9,3,6, 2\right\}. Then A \subseteq B (recall that element order is irrelevant).
  • Let A = \left\{ 2, 3, 9 \right\} and C = \left\{1,7, 3, 6, 2 \right\}. Then A \nsubseteq C (C does not contain 9).
  • Is \varnothing a subset of \left\{2,3,9\right\}?  Yes. For all A, \varnothing \subseteq A is true.

Definition 8. A set A is a proper subset of another set B, written A \subset B, if A \subseteq B and A \neq B.

Definition 9. For a given set A, its power set \mathbb{P} (A) is the set of all subsets of A.

  • Let A = \left\{ 0, 1 \right\}. Then \mathbb{P}(A) = \left\{ \varnothing ,  \left\{ 0 \right\} , \left\{ 1 \right\}, \left\{ 0, 1 \right\} \right\}.

A power set can be constructed by the use of a binary tree, as follows:

Naive Set Theory- Building Powersets (1)

As can be seen above, the total number of subsets must be a power of two. Specifically, if |A| = n, then |\mathbb{P}(A)| = 2^n.

It is important to get clear on the differences between the element-of ( \in ) versus subset-of ( \subseteq ) relations. Consider again A = \left\{ 0, 1 \right\} and its power set \mathbb{P}(A) = \left\{ \varnothing ,  \left\{ 0 \right\} , \left\{ 1 \right\}, \left\{ 0, 1 \right\} \right\}

  • A \in \mathbb{P}(A). But \left\{A\right\} \not\in \mathbb{P}(A). The \in relation requires the brackets match.
  • \left\{A\right\} \subseteq \mathbb{P}(A). But A \nsubseteq \mathbb{P}(A). The \subseteq relation requires the “extra bracket”.

Probability theory is intimately connected with the notion of power set. Why? Because many discrete probability distributions have \sigma-algebras draw from the power set of natural numbers \mathbb{P} ( \mathbb{N} ).

Cartesian Product, Tuples

Definition 10. Given two sets A and B, their Cartesian product A \times B is the set of all ordered pairs \langle a, b \rangle such that a \in A and b \in B. Note that, unlike the elements in a set, the elements of an ordered pair cannot be reordered.

  • Let A = \left\{1, 2, 3 \right\} and B = \left\{4, 5, 6\right\}. Then A \times B = \left\{ (1, 4) , (1, 5), (1,6), (2, 4), (2, 5), (2, 6),  (3, 4), (3, 5), (3, 6) \right\}.

We can represent this same example visually, as follows:

Naive Set Theory- Cartesian Product (1)

  • Contrast this with B \times A = \left\{ (3, 1) , (4, 1), (2, 4), (1, 3) \right\}. Thus, A \times B \neq B \times A. This is because elements within ordered pairs cannot be rearranged.
  • Note that |A \times B| = 9 = |A_{1}| \times |A_{2}|. In combinatorics, this observation generalizes to the multiplication principle.
  • The real plane \mathbb{R}^2 = \mathbb{R} \times \mathbb{R} is a well-known example of a Cartesian product.

Definition 11. Given n sets, A_{1}, A_{2}, \ldots , A_{n}, their Cartesian product is the set of all n-tuples.

  • Let A = \left\{1, 2\right\}, B = \left\{a, b\right\} and C = \left\{ 100 \right\}. Now A \times B \times C = \left\{ (1, a, 100) , (1, b, 100), (2, a, 100), (2, b, 100) \right\}.

Linear algebra is intimately connected with the Cartesian product operation. Why? Because n-tuples are strongly related to n-dimensional vectors. 

Intersection and Union

Definition 12. The intersection of two set A and B, written A \cap B , is the set of elements common to both sets.

  • Let A = \left\{ 2, 4, 9 \right\} and B = \left\{1, 2, 3, 6, 9 \right\}. Then A \cap B = \left\{ 2, 9 \right\}.
  • Let A = \left\{ 2, 4, 9 \right\} and B = \left\{1, 3, 6, 8, 10\right\}. Then A \cap B = \varnothing.
  • Let A = \left\{ 2, 4, 9 \right\}. Then \varnothing \cap A = A \cap \varnothing = \varnothing.

Definition 13. The union of two sets A and B, written A \cup B is the set of all elements that are in A or B, or both.

  • Let A = \left\{ 2, 4, 9 \right\} and B = \left\{1, 2, 3, 6, 9 \right\}. Then A \cup B = \left\{ 1, 2, 3, 4, 6, 9 \right\}.
  • Let A = \left\{ 2, 4, 9 \right\}. Then \varnothing \cup A = A \cup \varnothing = A.

Venn diagrams represent sets as enclosed areas in a 2D plane. If two (or more!) sets have shared elements, their areas overlap. We can use this technique to visualize sets and their overlap:

Naive Set Theory- Elements vs Venn Diagram

We can also use Venn diagrams to represent our intersection and union relations:

Naive Set Theory- Union vs Intersection Venn (1)

Note that |A \cup B| = |A| + |B| - |A \cap B|. This makes sense in light of the Venn diagram. Adding the cardinality of both sets counts the elements that exist in the middle section twice. To avoid this, we subtract the cardinality of the intersection. In combinatorics, this formula is generalized by the inclusion-exclusion principle

Difference and Complement

Definition 14. Given two sets A and B, their difference A \setminus B is the set of elements in A but not also in B.

  • Let A = \left\{ 1, 2, 3, 4, 5, 6 \right\} and B = \left\{0, 2, 4, 6, 8 \right\}. Then A \setminus B = \left\{ 1, 3, 5 \right\} and B \setminus A = \left\{ 0, 8 \right\}.
  • Let A = \left\{ 1, 2, 3, 4, 5, 6 \right\} and B = \left\{7, 8, 9, 10 \right\}. Then A \setminus B = A and B \setminus A = B.

Definition 15. Given two sets A and B, the symmetric difference A \triangle B is the set of elements in A or B, but not both.

  • Let A = \left\{ 1, 2, 3, 4, 5, 6 \right\} and B = \left\{0, 2, 4, 6, 8 \right\}. Then A \triangle B = \left\{ 0, 1, 3, 5, 8 \right\}.

Definition 16. In many set problems, all sets are defined as subsets of some reference set. This reference set is called the universe U.

  • Let A = \left\{ 1+i, 12-8j, 3+0i \right\} and let its universe be the set of complex numbers \mathbb{C}. It is true that A \subseteq \mathbb{C}.

Definition 17. Relative to a universe U, the complement of A, written \overline{A} is the set of all elements of U not contained in A.

  • Let U be the set of positive integers less than 10: U = \left\{ x | x \in \mathbb{Z}^{+}, x < 10 \right\} and A = \left\{ 1, 2, 3, 4, 5 \right\}. Then \overline{A} = \left\{ 6, 7, 8, 9 \right\}.

We can again represent these relations graphically, via Venn diagrams:

Naive Set Theory- Complement and Difference Venn (2)




Let me summarize this post in terms of our 17 definitions 🙂

  • Def 1-5 introduced the notion of set, set equality, the element-of operator, cardinality (set size), and empty set.
  • Def 6 introduced set builder notation, and the intension-extension tradeoff.
  • Def 7-9 introduced subset, proper subset, and power set.
  • Def 10-11 introduced Cartesian product, ordered pairs, and n-tuples.
  • Def 12-13 introduced intersection (“and”) and union (“or”), as well as Venn diagrams.
  • Def 14-17 introduced difference, symmetric difference (“xor”), and complement (“not”).

Want to learn more? I recommend the following resources:

This introductory article focused on promoting intuitions through worked examples. Next time, we’ll look at these same operations more carefully, and examine the relationship between set theory and classical predicate logic.

An Introduction to Topology

Part Of: Analysis sequence
Content Summary: 1000 words, 10 min read

Motivating Example

Can you draw three lines connecting A to A, B to B, and C to C?  The catch: the lines must stay on the disc, and they cannot intersect.

Topology- Motivating Problem (2)

Here are two attempts at a solution:

Topology- Potential Solutions (1)

Both attempts fail. In the first, there is no way for the Bs and Cs to cross the A line. In the second, we have made more progress… but connecting C is impossible.

Does any solution exist? It is hard to see how…

Consider a simplified puzzle. Let’s swap the inner points B and C.

Topology- Original vs Easy Puzzle (2)

In the new puzzle, the solution is easy: just draw straight lines between the pairs!

To understand where this solution breaks down, let’s use continuous deformation (i.e., homeomorphism) to transform this easier puzzle back to the original. In other words, let’s swap point B towards C, while not dropping the “strings” of our solution lines:


Deformation has led us to the solution! Note what just happened: we solved an easy problem, and than “pulled” that solution to give us insight into a harder problem.

As we will see, the power of continuous deformation extends far beyond puzzle-solving. It resides at the heart of topology, one of mathematics’ most important disciplines.

Manifolds: Balls vs Surfaces

The subject of arithmetic is the number. Analogously, in topology, manifolds are our objects. We can distinguish two kinds of primitive manifold: balls and surfaces.

Topology- Balls and Surfaces (1)

These categories generalize ideas from elementary school:

  • A 1-ball B^1 is a line segment
  • A 2-ball B^2 is a disc
  • S^1 is a circle
  • S^2 is a sphere

Note the difference between volumes and their surfaces. Do not confuse e.g., a disc with a circle. The boundary operation \partial makes the volume-surface relationship explicit. For example, we say that \partial B^2 = S^1.

Note that surfaces are one dimension below their corresponding volume. For example, a disc resides on a plane, but a circle can be unrolled to fit within a line.

Importantly, an m-ball and an m-cube are considered equivalent! After all, they can be deformed into one another. This is the reason for the old joke:

A topologist cannot tell the difference between a coffee cup and a donut. Why? Because both objects are equivalent under homeomorphism:


If numbers are the objects of arithmetic, operations like multiplication act on these numbers. Topological operations include product, division, and connected sum. Let us address each in turn.

On Product

The product (x) operation takes two manifolds of dimension m and n, and returns a manifold of dimension m+n. A couple examples to whet your appetite:

Topology- Examples of Product (1)

These formulae only show manifolds of small dimension. But the product operation can just as easily construct e.g. a 39-ball as follows:

B^{39} = \prod_{i=1}^{39} I^1

How does product relate to our boundary operator? By the following formula:

\partial (M x N) = ( \partial M x N) \cup (M x \partial N )

This equation, deeply analogous to the product rule in calculus, becomes much more clear by inspection of an example:
Topology- Product vs Boundary (1)

On Division

Division ( / ) glues together the boundaries of a single manifold. For example, a torus can be created from the rectangle I^{2}:


We will use arrows to specify which edges are to be identified. Arrows with the same color and shape must be glued together (in whatever order you see fit).

Topology- Division Simple Examples (2)

Alternatively, we can specify division algebraically. In the following equation, x=0 means “left side of cylinder” and x=1 means right side:

S^1 x I^1 = Cylinder = \frac{I^2}{(0,y) \sim (1, y) \forall y}

The Möbius strip is rather famous for being non-orientable: it neither has an inside nor an outside. As M.C. Escher once observed, an ant walking on its surface would have to travel two revolutions before returning to its original orientation.

More manifolds that can be created by division on I^{2}. To construct a Klein bottle by division, you take a cylinder, twist it, and fold it back on itself:

Topology- Klein Bottle Construction (5)

In our illustration, there is a circle boundary denoting the location of self-intersection. Topologically, however, the Klein bottle need not intersect itself. It is only immersion in 3-space that causes this paradox.

Our last example of I^{2} division is the real projective plane RP^{2}. This is even more difficult to visualize in 3-space, but there is a trick: cut I^{2} again. As long as we glue both pieces together along the blue line, we haven’t changed the object. 

Topology- Deriving Real Projective Plane First Part (1)

The top portion becomes a Möbius strip; the bottom becomes a disc. We can deform a disc into a sphere with a hole in it. Normally, we would want to fill in this hole with another disc. However, we only have a Möbius strip available.

But Möbius strips are similar to discs, in that its boundary is a single loop. Because we can’t visualize this “Möbius disc” directly, I will represent it with a wheel-like symbol.  Let us call this special disc by a new name: the cross cap.

The real projective plane, then, is a cross cap glued into the hole of a sphere.  It is like a torus; except instead of a handle, it has an “anomaly” on its surface.

Topology- Deriving Real Projective Plane Second Part

These then, are our five “fundamental examples” of division:

Topology- Division Overview (3)

On Connected Sum

Division involves gluing together parts of a single manifold. Connected sum (#), also called surgery, involves gluing two m-dimensional manifolds together. To accomplish this, take both manifolds, remove an m-ball from each, and identify (glue together) the boundaries of the holes. In other words:

\frac{ ( M_1 / B_1 ) \cup ( M_2 / B_2 ) }{ \partial ( M_1 / B_1 ) \sim \partial ( M_2 / B_2 )} = M_1 \# M_2

Let’s now see a couple examples. If we glue tori together, we can increase the number of holes in our manifold. If we attach a torus with a real projective plane, we acquire a manifold with holes and cross-cuts.
Topology- Connected Sum examples (3)


  • Topology, aka. “rubber sheet geometry”, is the study of malleable objects & spaces.
  • In topology, manifolds represent objects in n-dimensional space.
  • Manifolds either represent volumes (e.g., disc) and boundaries (e.g., circles)
  • Manifolds are considered equivalent if a homeomorphism connects them.
  • There are three basic topological operations:
    • Product (x) is a dimension-raising operation (e.g., square can become a cube).
    • Division (/) is a gluing operation, binding together parts of a single manifold.
    • Connected sum (#) i.e., surgery describes how to glue two manifolds together.

Related Materials

This post is based on Dr. Tadashi Tokeida’s excellent lecture series, Topology & Geometry. For more details, check it out!

OLS Estimation via Projection

Part Of: Machine Learning sequence
Content Summary: 800 words, 8 min read

Projection as Geometric Approximation

If we have a vector b and a line determined by vector a, how do we find the point on the line that is closest to b?

OLS- Geometry of Projection

The closest point p is at the intersection formed by a line through b that is orthogonal to a. If we think of p as an approximation to b, then the length of e = b - p is the error of that approximation.

a^T e = a^T (b - ax) = 0

This formula captures projection onto a vector. But what if you want to project to a higher dimensional surface?

Imagine a plane, whose basis vectors are a_1 and a_2. This plane can be described with a matrix, by mapping the basis vectors onto its column space:

A = \begin{bmatrix} a_1 & a_2 \end{bmatrix}

Suppose we want to project vector b onto this plane. We can use the same orthogonality principle as before:

A^Te = A^T(b-Ax) = 0

A^TAx = A^Tb

Matrices like A^TA are self-transpositions. We have shown that such matrices are square symmetric, and thereby contain positive, real eigenvalues.

We shall assume that the columns of A^TA are independent, and it thereby is invertible. The inverse thereby allows us to solve for x:

(A^TA)^{-1}(A^TA)x = (A^TA)^{-1}A^Tb

x = (A^TA)^{-1}A^Tb

Recall that,

p = Ax = A(A^TA)^{-1}A^Tb

Since matrices are linear transformations (functions that operate on vectors), it is natural to express the problem in terms of a projection matrix P, that accepts a vector b, and outputs the approximating vector p:

p = Pb

By combining these two formula, we solve for P:

P = A(A^TA)^{-1}A^T

Thus, we have two perspectives on the same underlying formula:

OLS- Regression Functions via Matrices (1)

Linear Regression via Projection

We have previously noted that machine learning attempts to approximate the shape of the data. Prediction functions include classification (discrete output) and regression (continuous output).

Consider an example with three data points. Can we predict the price of the next item, given its size?

OLS- Regression Function Setup

For these data, a linear regression function will take the following form:

\psi : Size \rightarrow Price

\psi(Size) = \beta_0 + \beta_1 Size

We can thus interpret linear regression as an attempt to solve Ax=b:

OLS- Linear Algebra Regression Matrix

In this example, we have more data than parameters (3 vs 2). In real-world problems, it is an extremely common predicament. It yields matrices with may more equations than unknowns. This means that Ax=b has no solution (unless all data happen to fall on a straight line).

If exact solutions are impossible, we can still hope for an approximating solution. Perhaps we can find a vector p that best approximates b. More formally, we desire some p = A\bar{x} such that the error e = b-p is minimized.

Since projection is a form of approximation, we can use a projection matrix to construct our linear prediction function \psi : Size \rightarrow Price.

OLS- Least Squares Fundamental Spaces

A Worked Example

The solution is to make the error b-Ax as small as possible. Since Ax can never leave the column space, choose the closest point to b in that subspace. This point is the projection p. Then the error vector e = b-p has minimal length.

To repeat, the best combination p = Ax is the projection of b onto the column space. The error is perpendicular to that subspace. Therefore e = b-p is in the left nullspace:

Ax = b

A^TA = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \\ \end{bmatrix}

We can use Guass-Jordan Elimination to compute the inversion:

(A^TA)^{-1} = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \\ \end{bmatrix}

A useful intermediate quantity is as follows:

(A^TA)^{-1}A^T = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \\ \end{bmatrix} = \begin{bmatrix} 4/3 & 1/3 & -2/3 \\ -1/2 & 0 & 1/2 \\ \end{bmatrix}

We are now able to compute the parameters of our model, \bar{x}:

\bar{x} = \left[ (A^TA)^{-1}A^T \right] b = \begin{bmatrix} 4/3 & 1/3 & -2/3 \\ -1/2 & 0 & 1/2 \\ \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 2/3 \\ 1/2 \\ \end{bmatrix}

These parameters generate a predictive function with the following structure:

\psi : Size \rightarrow Price

\psi(Size) = \frac{2}{3} + \frac{1}{2}Size

These values correspond with the line that best fits our original data!

Wrapping Up


  • In linear algebra, projection approximates a high-dimensional surface in a lower-dimensional space. The projection error can be measured.
  • In linear regression, we usually cannot solve Ax=b, because there tends to be more data than parameters (b is not in the column space)
  • We can find the closest vector in the column space by projecting onto b, and minimizing the projection error.
  • Thus, the operation of projection can be used to perform parameter estimation, and produce a model that best approximates the training data.

Related Resources:

Singular Value Decomposition

Part Of: Algebra sequence
Followup To: Eigenvalues and Eigenvectors
Content Summary: 1300 words, 13 min read.

Limitations of Eigendecomposition

Last time, we learned how to locate eigenvalues and eigenvectors of a given matrix. Diagonalization is the process where a square matrix can be decomposed into two factors: a matrix Q (with eigenvectors along the columns) and a matrix \Lambda (with eigenvalues along the diagonal).

A = Q \Lambda Q^T

But as we saw with the spectral theorem, eigendecomposition only works well against square, symmetric matrices. If a matrix isn’t symmetric, it is easy to run into complex eigenvalues. And if a matrix isn’t square, you are out of luck entirely!

Eigendecomposition- Spectral Theorem (1)

Can we generalize eigendecomposition to apply to a wider family of matrices? Can we diagonalize matrices of any size, even if they don’t have “nice” properties?

Yes. Self-transposition is the key insight of our “eigendecomposition 2.0”. We define the self-transpositions of A as AA^{T} and A^{T}A.

Suppose A \in \mathbb{R}^{m x n}. Then AA^T \in \mathbb{R}^{mxm} and A^TA \in \mathbb{R}^{nxn}. So these matrices are square. But they are also symmetric!

To illustrate, consider the following.

A = \begin{bmatrix}  4 & 4 \\ -3 & 3 \\ \end{bmatrix}

Since A is not symmetric, we have no guarantee that its eigenvalues are real. Indeed, its eigenvalues turn out to be complex:

\det(A - \lambda I) = \begin{bmatrix} 4 - \lambda & 4 \\ -3 & 3 - \lambda \\ \end{bmatrix} = 0

(12 - 7 \lambda + \lambda^2) + 12 = 0 \Rightarrow \lambda^2 -7 \lambda + 24 = 0

\lambda = \frac{7 \pm \sqrt{(-7)^2 - 4*1*24}}{2*1} = \frac{7}{2} \pm \frac{\sqrt{47}i}{2}

Eigendecomposition on A sucks. Are the self-transposed matrices any better?

A^TA = \begin{bmatrix} 4 & -3 \\ 4 & 3 \\ \end{bmatrix} \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix} = \begin{bmatrix} 25 & 7 \\ 7 & 25 \\ \end{bmatrix}

AA^T = \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix} \begin{bmatrix} 4 & -3 \\ 4 & 3 \\ \end{bmatrix} = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}

These matrices are symmetric! Thus, they are better candidates for eigendecomposition.

Towards Singular Value Decomposition

Singular Value Decomposition (SVD) is based on the principle that all matrices are eigendecomposable after self-transposition. It is essentially a bug fix:

SVD- Self-Transposition

An important way to picture SVD is with the idea of orthogonal bases. It is relatively easy to find any number of orthogonal bases for a given rowspace. Call the matrix of orthogonal vectors V.

We desire to find an orthogonal basis V such that AV produces an orthogonal basis in the column space. Orthogonal bases are not particularly hard to find. But most orthogonal bases, once projected to column space, will lose their orthogonality! We desire that particular orthogonal basis U such that V is also orthogonal.

We won’t require vectors in U to be the same size as those in V. Instead, we will normalize V; its basis vectors will be orthonormal (orthogonal and normal). Then the length of each vector in U will diverge by a scaling factor.

As we will soon see, these scaling factors are not eigenvalues. Instead, we will use sigmas instead of lambdas.

  • Scaling factors \sigma , analogous to eigenvalues \lambda.
  • Diagonal matrix \Sigma , analogous to diagonal matrix \Lambda

Our full picture then, looks like this:

SVD- Orthogonal Basis Transfer

Let us now translate this image into matrix language.

A \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ v_1 & v_2 & \dots & v_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \sigma_1u_1 & \sigma_2u_2 & \dots & \sigma_nu_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}

But we can easily factorize the right-hand side:

\begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \sigma_1u_1 & \sigma_2u_2 & \dots & \sigma_nu_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ u_1 & u_2 & \dots & u_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} * \begin{bmatrix} \sigma_1 & 0 & \dots & 0 \\ 0 & \sigma_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_n \\ \end{bmatrix}

So we have that:

AV = U \Sigma

Since both V and U are orthogonal, inversion of either is equivalent to transposition:

A = U \Sigma V^{-1} = U \Sigma V^T

This strongly resembles our diagonalization equation A = Q \Lambda Q^T. SVD distinguishes itself by considering two orthogonal eigenmatrices U and V, not just one (Q).

Recalling that A = U \Sigma V^T ,

A^TA = (V \Sigma^T U^T) (U \Sigma V^T)

But now the innermost term cancels. Since \Sigma is a square diagonal matrix, its self-transposition is simply equal to \Sigma^{2}. So,

A^TA = V \Sigma^2 V^T

Since A^{T}A is a square, symmetric matrix, our diagonalization theorem applies!

A^TA = V \Sigma ^2 V^T = Q \Lambda Q^T

To find U, a similar trick works:

AA^T =  (U \Sigma V^T)(V \Sigma^T U^T) = U \Sigma^2 U^T = Q \Lambda Q^T

The relationships between SVD and eigendecomposition are as follows:

  • V is the eigenvectors of A^TA
  • U is the eigenvectors of AA^T
  • \Sigma is the square root of the eigenvalues matrix \Lambda

If any eigenvalue is negative, the corresponding sigma factor would be complex. But A^TA and AA^T are positive-semidefinite, which guarantees non-negative eigenvalues. This assures us that \Sigma contains only real values.

In contrast to eigendecomposition, every matrix has an SVD decomposition. Geometrically, V and U act as rotational transformations, and Sigma acts as a scaling transformation. In other words, every linear transformation comprises a rotation, then scaling, then another rotation.

SVD- Rotation Scaling Rotation (1)

A Worked Example

Let’s revisit A. Recall that:

A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix}, A^TA = \begin{bmatrix} 25 & 7 \\ 7 & 25 \\ \end{bmatrix}, AA^T = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}

Eigendecomposition against A is unpleasant because A is not symmetric. But A^{T}A is guaranteed to be positive semi-definite; that is, to have non-negative eigenvalues. Let’s see this in action.

det(A^TA - \lambda I) = \begin{bmatrix} 25 - \lambda & 7 \\ 7 & 25 - \lambda \\ \end{bmatrix} = 0

\lambda^2 - 50 \lambda + 576 = 0 \Rightarrow (\lambda_1 - 32)(\lambda_2 - 18) = 0

\lambda_1 = 32, \lambda_2 = 18

trace(A^TA) = 50 = \sum{\lambda_i}

det(A^TA) = 625-49 = 576 = 18 * 32 = \prod{\lambda_i}

These are positive, real eigenvalues. Perfect! Let’s now derive the corresponding (normalized) eigenvectors.

A - 32I = \begin{bmatrix} -7 & 7 \\ 7 & -7 \\ \end{bmatrix} = 0, A - 18I = \begin{bmatrix} 7 & 7 \\ 7 & 7 \\ \end{bmatrix} = 0

rref(A - 32I) = \begin{bmatrix} -1 & 1 \\ 0 & 0 \\ \end{bmatrix} = 0, rref(A-18I) = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ \end{bmatrix} = 0

v_1 = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}, v_2 = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}} \\ \end{bmatrix}

SVD intends to decompose A into U \Sigma V^{T}. The above findings give us two of these ingredients.

V^{T} = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \end{bmatrix}

\Sigma =\begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix}

What’s missing? U! To find it, we perform eigendecomposition on AA^{T}. This is an especially easy task, because AA^{T} is already a diagonal matrix.

AA^T = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}

U = \begin{bmatrix} u_1 & u_2 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix}

We have arrived at our first Singular Value Decomposition.

A = U \Sigma V^T = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & - \frac{1}{\sqrt{2}} \\ \end{bmatrix}

Okay, so let’s check our work. 😛

A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & - \frac{1}{\sqrt{2}} \\ \end{bmatrix} \right) = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} 4 & 4 \\ 3 & 3 \\ \end{bmatrix} = \begin{bmatrix} 4 & 4 \\ 3 & 3 \\ \end{bmatrix}

These matrices are a viable factorization: multiplication successfully recovers A.


  • Eigendecomposition only works for a subclass of matrices; SVD decomposes all matrices.
  • SVD relies on self-transposition to convert any arbitrary matrix into one that works well against eigendecomposition (guarantees square m = n and symmetric A = A^{T}).
  • Another way to interpret SVD is by taking a special kind of orthogonal basis that, once passed through the linear transformation, preserves its orthogonality.
  • Every matrix A = U \Sigma V^{T}. That is, every linear transformation can be conceived as rotation + scaling + rotation.

Until next time.

Eigenvalues and Eigenvectors

Part Of: Algebra sequence
Followup To: An Introduction to Linear Algebra
Next Up: Singular Value Decomposition
Content Summary: 1300 words, 13 min read

Geometries of Eigenvectors

Matrices are functions that act on vectors, by mapping from row-vectors to column-vectors.  Consider two examples:

  1. Reflection matrices, which reflect vectors across some basis.
  2. Rotation matrices, which rotate vectors clockwise by \theta degrees.


The set of eigenvectors of a matrix A is a special set of input vectors for which the matrix behaves as a scaling transformation. In other words, we desire the set of vectors \vec{x} whose output vectors A\vec{x} differ by a scaling factor.

Eigenvectors have a straightforward geometric interpretation:

  1. Reflection eigenvectors are orthogonal or parallel to the reflecting surface. In the left image above, that is the top two pairs of vectors.
  2. Rotation eigenvectors do not exist (more formally, cannot be visualized in \mathbb{R}^2).

Algebra of Eigenvectors

We can express our “parallel output” property as:

A\vec{x} = \lambda \vec{x}

Thus \vec{x} and A\vec{x} point in the same direction, but differ by scaling factor \lambda.

Scaling factor \lambda is the eigenvalue. There can be many \left( x, \lambda \right) pairs that satisfy the above equality.

For an \mathbb{R}^{n x n} matrix, there are n eigenvalues. These eigenvalues can be difficult to find. However, two facts aid our search:

  • The sum of eigenvalues equals the trace (sum of values along the diagonal).
  • The product of eigenvalues equals the determinant.

To solve, subtract \lambda \vec{x} from both sides:

A\vec{x} = \lambda \vec{x}

(A - \lambda I)\vec{x} = 0

We would like to identify n unique eigenvectors. But if the new matrix (A - \lambda I) has an empty nullspace, it will contain zero eigenvectors. So we desire this new matrix to be singular.

How to accomplish this?  By finding eigenvalues that satisfy the characteristic equation \det(A - \lambda I) = 0. Matrices are singular iff their determinants equal zero.

Let’s work through an example! What is the eigendecompositon for matrix A:

A = \begin{bmatrix} 3 & 1 \\ 1 & 3 \\ \end{bmatrix}

We need to find eigenvalues that solve the characteristic equation.

\det(A - \lambda I) = \begin{vmatrix} 3-\lambda & 1 \\ 1 & 3-\lambda \\ \end{vmatrix} = 0

(3 - \lambda)^2 - 1^2 = \lambda_2 -6\lambda + 8 = (\lambda-2)(\lambda-4) = 0

\lambda_1 = 2, \lambda_2 = 4

Are these eigenvalues correct? Let’s check our work:

trace(A) = 6 = \sum{\lambda_i}

det(A) = 8 = \prod{\lambda_i}

How to find our eigenvectors? By solving the nullspace given each eigenvalue.

For \lambda_1=2 :

A - 2I = \begin{bmatrix} 1 & 1 \\ 1 & 1 \\ \end{bmatrix} \Rightarrow rref(A - 2I) = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ \end{bmatrix}

(\lambda_1, \vec{x}_1) = (2, \begin{bmatrix} 1 \\ -1 \\ \end{bmatrix})

For \lambda_2=4 :

A - 4I = \begin{bmatrix} -1 & 1 \\ 1 & -1 \\ \end{bmatrix} \Rightarrow rref(A - 4I) = \begin{bmatrix} -1 & 1 \\ 0 & 0 \\ \end{bmatrix}

(\lambda_2, \vec{x}_2) = (4, \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix})

Desirable Matrix Properties

The above example was fairly straightforward. But eigendecomposition can “go awry”, as we shall see. Consider a rotation matrix, which in two dimensions has the following form:

R = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \\ \end{bmatrix}

What are the eigenvalues for rotation \theta = 90^{\circ} ?

R = \begin{bmatrix} 0 & -1 \\ 1 & 0 \\ \end{bmatrix}

\det(R - \lambda I) = \begin{bmatrix} - \lambda & -1 \\ 1 & - \lambda \\ \end{bmatrix} = 0

(- \lambda)^2 - 1^2 \Rightarrow \lambda^2 = 1

\lambda_1 = i, \lambda_2 = -i

We can check our work:

trace(R) = 0 = \sum{\lambda_i}

det(R) = 1 = \prod{\lambda_i}

We saw earlier that rotation matrices have no geometric interpretation. Here, we have algebraically shown that its eigenvalues are complex.

A = \left[ \begin{smallmatrix} 3 & 1 \\ 1 & 3 \\ \end{smallmatrix} \right] has real eigenvalues, but R = \left[ \begin{smallmatrix} 0 & -1 \\ 1 & 0 \\ \end{smallmatrix} \right] has less-desirable complex eigenvalues.

We can generalize the distinction between A and R as follows:

Spectral Theorem. Any matrix that is symmetric (A = AT) is guaranteed to have real, nonnegative eigenvalues. The corresponding n eigenvectors are guaranteed to be orthogonal.

In other words, eigendecomposition works best against symmetric matrices.

Eigendecomposition- Spectral Theorem (1)


Let us place each eigenvector in the column of a matrix S. What happens when you multiply the original matrix A by this new matrix? Since S contains eigenvectors, multiplication by A reduces to multiplication by the associated eigenvalues:

AS = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \lambda_1x_1 & \lambda_2x_2 & \dots & \lambda_nx_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}

We see the product contains a mixture of eigenvalues and eigenvectors. We can separate these by “pulling out” the eigenvalues into a diagonal matrix. Call this matrix \Lambda (“capital lambda”).

AS = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ x_1 & x_2 & \dots & x_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} * \begin{bmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_n \\ \end{bmatrix} = S \Lambda

Most matrices have the property that its eigenvectors are linearly independent. For such matrices, S is invertible. Given this fact, we can solve for A:

\Lambda = S^{-1} A S

A = S \Lambda S^{-1}

Matrices that can be factorized in this way are said to be diagonalizable. We can see that both elimination and eigendecomposition are performing the same type of work: factorizing matrices into their component parts.

If A is symmetric, then we know \Lambda is real, and its eigenvectors in S are orthogonal. Let us rename S to be Q, to reflect this additional property. But orthogonal matrices have the property that transposition equats inversion: Q^T = Q^{-1}. Thus, if A is symmetric, we can simplify the diagonalization formula to:

A = Q \Lambda Q^{-1} = Q \Lambda Q^T

Asymptoptic Interpretations

This diagonalization approach illustrates an important use case of eigenvectors: power matrices. What happens when A is applied arbitrarily many times? What does the output look like in the limit?

We can use the diagonalization equation to represent A^k:

A^k = \prod_{i=1}^{k} (Q^{-1} \Lambda Q) = (Q^{-1} \Lambda Q)(Q^{-1} A Q)\dots(Q^{-1} \Lambda Q)

We can simplify by canceling the inner terms QQ^{-1}:

A^k = Q^{-1} \Lambda^k Q

This equation tells us that the eigenvectors is invariant to how many times A is applied. In contrast, eigenvalue matrix \Lambda has important implications for ongoing processes:

  • If each eigenvalue has magnitude less than one, the output will trend towards zero.
  • If each eigenvalue has magnitude greater than one, the output will trend to infinity.

Fibonacci Eigenvalues

The powers interpretation of eigenvalues sheds light on the behavior of all linear processes. This includes number sequences such as the Fibonacci numbers, where each number is the sum of the previous two numbers.

Recall the Fibonacci numbers are 0,1,1,2,3,5,8,13,... What is F_{100} ?

Eigenvalues can answer this question. We must first express the Fibonacci generator as a linear equation:

F(k+2) = 1F(k+1) + 1F(k)

In order to translate this into a meaningful matrix, we must add a “redundant” equation

F(k+1) = 1F(k+1) + 0F(k)

With these equations, we can create a 2×2 Fibonacci matrix F.

F = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix}

This matrix uniquely generates Fibonacci numbers.

u_1 = Fu_0 = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix} \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix}

u_4 = F^4u_0 = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix}^4 \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} = \begin{bmatrix} 5 \\ 3 \\ \end{bmatrix}

To discover the rate at Fibonacci numbers grow, we decompose F into its eigenvalues:

\det(F - \lambda I) = \begin{vmatrix} 1 - \lambda & 1 \\ 1 & 1- \lambda \\ \end{vmatrix} = 0

 \lambda^2 - 2\lambda - 1 = 0

 \lambda_1 = \frac{1 + \sqrt{5}}{2}, \lambda_2 = \frac{1 - \sqrt{5}}{2}

 \lambda_1 = 1.61803, \lambda_2 = -0.61803

trace(F) = 1 = \sum{\lambda_i}

det(F) = -1 = \prod{\lambda_i}

We can go on to discover eigenvectors x_1 and x_2. We can then express the Fibonnaci matrix F as

F = \lambda_1x_1 + \lambda_2 x_2

F^k = \lambda_1^k x_1 + \lambda_2^k x_2

As k goes to infinity, the second term goes to zero. Thus, the ratio is dominated by the larger eigenvalue, 1.61803.

Mathematicians in the audience will recognize this number as the golden ratio.


We have long known that the ratio of successive Fibonnaci numbers converges to 1.61803. Eigenvalues provide a mechanism to derive this value analytically.

Until next time.

Algorithmic Dark Matter: Duality in Linear Programming

Part Of: Optimization sequence
Content Summary: 800 words, 8 min read.

Today, I introduce the concept of duality. Buckle your seat belts! 🙂

Max Flow algorithm

The problem of flow was originally studied in the context of the USSR railway system in the 1930s. The Russian mathematician A.N. Tolstoy published his Methods of finding the minimal total kilometrage in cargo-transportation planning in space, where he formalized the problem as follows.

We interpret transportation graphically. Vertices are interpreted as cities, and edges are railroads connecting two cities. The capacity of each edge was the amount of goods that particular railroad could transport in a given day. The bottleneck is solely the capacities, and not production and consumption. We assume no available storage at the intermediate cities.

The flow allocation problem defines source and termination vertices, s and t. We desire to maximize volume of goods transported s → t. To do this, we label each edge with the amount of goods we intend to ship on that railroad. This quantity, which we will call flow, must respect the following properties:

  • Flow cannot exceed the capacity of the railroad.
  • Flow is conserved: stuff leaving a city must equal the amount of stuff arriving.

Here are two possible solutions to flow allocation:


Solution B improves on A by pushing volume onto the b → c railway. But are there better solutions?

To answer rigorously, we formalize max flow as a linear optimization problem:


The solution to LP tells us that no, eight is the maximum possible flow.

Min Cut algorithm

Consider another, seemingly unrelated, problem we might wish to solve: separability. Let X ⊂ E represent the number of edges you need to remove to eliminate a connection s → t. Here are two such solutions:

Duality- Two Cut Solutions (1).png

Can we do better than B? The answer is no: { (b,t), (c,d) } is the minimum cut possible in the current graph. In linear programming terms, it is the Best Feasible Solution (BFS)

Note that the BFS of minimum cut and the BFS of max flow arrive at the same value. 8 = 8. This is not a coincidence. In fact, these problems are intimately related to one another. What the min cut algorithm is searching for is the bottleneck: the smallest section of the “pipe” from s → t. For complex graphs like this, it is not trivial to derive this answer visually; but the separability algorithm does the work for us.

The deep symmetry between max flow and min cut demonstrates an important mathematical fact. All algorithms come in pairs. For this example, we will call max flow and min cut the primal and dual problems, respectively. We will explore the ramifications for this another time. For now, let’s approach duality from an algebraic perspective.

Finding LP Upper Bound

Consider a linear program with the following objective function:

\max (2x_1 + x_2)

And these constraints

4x_1 + x_2 \leq 6

x_1 + 2x_2 \leq 5

x_1, x_2 \geq 0

This program wants to find the largest solution possible given constraints. Can we provide an upper bound on the solution?

Yes. We can immediately say that the solution is no greater than 6. Why? The objective function, 2x_1 + x_2 is always smaller than 4x_1 + x_2, because we know all variables are positive. So we have an upper bound OPT ≤ 6. We can sharpen this upper bound, by comparing the objective function to other linear combinations of the constraints.  


Different weights to our linear combinations produce different upper bounds:

  • (1,0) → 6
  • (0,1) → 5
  • (⅓, ⅓ ) → 3.67

Let us call these two weights (y_1, y_2). What values of these variables give us the smallest upper bound? Importantly, this is itself an objective function: \min (6y_1 + 5y_2).

But y_1 and y_2 are constrained: they must produce an equation that exceeds 2x_1 + x_2. Thus,

y_1(a) + y_2(b) \geq 2x_1 + x_2

y_1 \left( 4x_1 + x_2 \right) + y_2 \left( 3x_1 + 2x_2 \right) \geq 2x_1 + x_2

\left(4y_1 + 3y_2 \right) x_1 + \left (y_1 + 2y_2 \right) x_2 \geq 2x_1 + x_2

(4y_1 + 3y_2) \geq 2 and (y_1 + 2y_2) \geq 1

This gives us our two constraints. Thus, by looking for the lowest upper bound on our primal LP, we have derived our dual LP:


Note the extraordinary symmetry between primal and dual LPs. The purple & orange values are mirror images of one another. Further, the constraint coefficient matrix has transposed (the 3 has swapped along the diagonal). This symmetry is reflected in the above linear algebra formulae. 

A Theory of Duality

Recall that linear programs have three possible outcomes: infeasible (no solution exists), unbounded (solution exists at +/-∞) or feasible/optimal. Since constraints are nothing more than geometric half-spaces, these possible outcomes reflect three kinds of polyhedra:


The outcome of primal and dual programs are predictably correlated. Of the nine potential pairings, only four can actually occur:

  1. Both P and D are infeasible
  2. P is unbounded and D is infeasible
  3. D is unbounded and P is infeasible
  4. Both are feasible, and there exist optimal solutions

Finally, in the above examples, we saw that the optimal dual value p^* = d^* (8=8). But this is not always the case. In fact, the optimal dual value can be smaller .

We can distinguish between two kinds of duality:

  • Strong duality, where p^* = d^*
  • Weak duality, where p^* - d^* \geq 0


Today, we have illustrated a deep mathematical fact: all problems come in pairs. Next time, we will explore the profound ramifications of this duality principle.

Related Resources: CMU Lecture 5: LP Duality

Entropy as Belief Uncertainty

Part Of: Information Theory sequence
Content Summary: 900 words, 9 min read


What do probabilities mean?

A frequentist believes that they represent frequencies. P(snow) =10% means that on 100 days just like this one, 10 of them will have snow.

A Bayesian, on the other hand, views probability as degree of belief. P(snow) = 10% means that you believe there is a 10% chance it will snow today.

This subjective approach views reasoning as probability (degree of belief) spread over possibility. On this view, Bayes Theorem provides a complete theory of inference:


From this equation, we see how information updates our belief probabilities. Bayesian updating describes this transition from prior to posterior, P(H) → P(H|E).

As evidence accumulates, one’s “belief distributions” tend to become sharply peaked. Here, we see degree of belief in a hockey goalie’s skill, as we observe him play. (Image credit Greater Than Plus Minus):


What does it mean for a distribution to be uncertain? We would like to say that our certainty grows as the distribution sharpens. Unfortunately, probability theory provides no language to quantify this intuition.

This is where information theory comes to the rescue. In 1948 Claude Shannon discovered a unique, unambiguous way to measure probabilistic uncertainty. 

What is this function? And how did he discover it? Let’s find out.  

Desiderata For An Uncertainty Measure

We desire some quantity H(p) which measures the uncertainty of a distribution.  

To derive H, we must specify its desiderata, or what we want it to do. This task may feel daunting. But in fact, very simple conditions already determine H to within a constant factor. 

We require H to meet the following conditions:

  1. Continuous. H(p) is a continuous function.
  2. Monotonic. H(p) for an equiprobable distribution (that is, A(n) = H(1/n, 1/n, 1/n)) is a monotonic increasing function of n.
  3. Compositionally Invariant. If we reorganize X by bundling individual outcomes into single variables (b: X → W), H is unchanged, H(X) = H(W).

Let’s explore compositional invariance in more detail.

Deriving H

Let us consider some variable X that can assume discrete values (x_1, ..., x_n). Our partial understanding of the processes which determine X are the probabilities (p_1, ..., p_n). We would like to find some H(p_1, ..., p_n), which measures the uncertainty of this distribution.

Suppose X has three possible outcomes. We can derive W by combining events xand x3


The uncertainty of X must be invariant to such bundling. So we have that:


The right tree has two distributions p(W) and p(X|W). The uncertainty of two distributions is the sum of each individual uncertainty. Thus we add H(⅔, ⅓). But this distribution is reached only ½ of the time, so we multiply by 0.5.

How does composition affect equiprobable distributions A(n)? Consider a new X with 12 possible outcomes, each equally likely to occur. The uncertainty H(X) = A(12), by definition. Suppose we choose to bundle these branches by (3,5,4). Then we have:


But suppose we choose a different bundling function (4,4,4). This simplifies things:


For what function of A does A(mn) = A(m) + A(n) hold? There is only one solution, as shown in Shannon’s paper:

A(X) = - Klog(X)

K varies with logarithmic base (bits, trits, nats, etc). With this solution we can derive a general formula for entropy H.


X = (x_1, ..., x_n), P(X) = (p_1, ..., p_n)

A(X) = K \log(X) ← Found by uniform bundling (eg., 4,4,4)

A(\sum{n}) = H(X) + \sum\limits_{i} \left( \frac{b_i}{\sum{n}} \right) A(b_i) ← Found by arbitrary bundling (eg., 3,5,4)


Klog(\sum{n_i}) = H(X) + K \sum{p_i \log(n_i)}

K \left[ \sum{p_i \log(\sum{n_i})} - \sum{p_i \log(n_i)} \right]

H = -K \sum{p_i \log\left(\frac{n}{\sum{n_i}} \right)}

We have arrived at our definition of uncertainty, the entropy H(X):

H(X) = -K \sum{p_i \log(p_i)}

To illustrate, consider a coin with bias p.  Our uncertainty is maximized for a fair coin, p = 0.5, and smallest at p = 0.0 (certain tails) or 1.0 (certain heads).


Entropy vs Information

What is the relationship between uncertainty and information? To answer this, we must first understand information.

Consider the number of possible sentences in a book. Is this information? Two books contain exponentially more possible sentences than one book.

When we speak of information, we desire it to scale linearly with its length. Two books should contain approximately twice as much information.

If we take the logarithm of the possible messages W, we can preserve this intuition:

I(X) = K \log(W) = K \sum{P(X)}

Recall that,

H(X) = -K \sum{P_i(X) \log P_i(X)}

From here, we can show that entropy is expected information:

H(X) = \sum{P_i(X) \log P_i(X)}

H = E\langle I \rangle

What does this discovery mean, though?

Imagine a device that produces 3 symbols, A, B, or C. As we wait for the next symbol, we are uncertain which symbol comes next. Once a symbol appears our uncertainty decreases, because we have received more information. Information is a decrease in entropy.

If A, B, and C occur at the same frequency, we should not be surprised to see any one letter. But if P(A) approaches 0, then we will be very surprised to see it appear, and the formula says I(X) approaches ∞. For the receiver of a message, information represents surprisal.

On this interpretation, the above formula becomes clear. Uncertainty is anticipated surprise. If our knowledge is incomplete, we expect surprise. But confident knowledge is “surprised by surprise”. 


The great contribution of information theory lies in a measure for probabilistic uncertainty.

We desire this measure to be continuous, monotonic, and compositionally invariant. There is only one such function, the entropy H:

H(X) = -K \sum{p_i \log(p_i)}

This explains why a broad distribution is more uncertain than one that is narrow.

Henceforth, we will view the words “entropy” and “uncertainty” as synonymous.

Related Works

  • Shannon (1948). A Mathematical Theory of Communication
  • Jaynes (1957). Information Theory and Statistical Mechanics
  • Schneider (1995). Information theory primer