# Singular Value Decomposition

Part Of: Algebra sequence
Followup To: Eigenvalues and Eigenvectors
Content Summary: 1300 words, 13 min read.

Limitations of Eigendecomposition

Last time, we learned how to locate eigenvalues and eigenvectors of a given matrix. Diagonalization is the process where a square matrix can be decomposed into two factors: a matrix $Q$ (with eigenvectors along the columns) and a matrix $\Lambda$ (with eigenvalues along the diagonal).

$A = Q \Lambda Q^T$

But as we saw with the spectral theorem, eigendecomposition only works well against square, symmetric matrices. If a matrix isn’t symmetric, it is easy to run into complex eigenvalues. And if a matrix isn’t square, you are out of luck entirely!

Can we generalize eigendecomposition to apply to a wider family of matrices? Can we diagonalize matrices of any size, even if they don’t have “nice” properties?

Yes. Self-transposition is the key insight of our “eigendecomposition 2.0”. We define the self-transpositions of $A$ as $AA^{T}$ and $A^{T}A$.

Suppose $A \in \mathbb{R}^{m x n}$. Then $AA^T \in \mathbb{R}^{mxm}$ and $A^TA \in \mathbb{R}^{nxn}$. So these matrices are square. But they are also symmetric!

To illustrate, consider the following.

$A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix}$

Since A is not symmetric, we have no guarantee that its eigenvalues are real. Indeed, its eigenvalues turn out to be complex:

$\det(A - \lambda I) = \begin{bmatrix} 4 - \lambda & 4 \\ -3 & 3 - \lambda \\ \end{bmatrix} = 0$

$(12 - 7 \lambda + \lambda^2) + 12 = 0 \Rightarrow \lambda^2 -7 \lambda + 24 = 0$

$\lambda = \frac{7 \pm \sqrt{(-7)^2 - 4*1*24}}{2*1} = \frac{7}{2} \pm \frac{\sqrt{47}i}{2}$

Eigendecomposition on $A$ sucks. Are the self-transposed matrices any better?

$A^TA = \begin{bmatrix} 4 & -3 \\ 4 & 3 \\ \end{bmatrix} \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix} = \begin{bmatrix} 25 & 7 \\ 7 & 25 \\ \end{bmatrix}$

$AA^T = \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix} \begin{bmatrix} 4 & -3 \\ 4 & 3 \\ \end{bmatrix} = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}$

These matrices are symmetric! Thus, they are better candidates for eigendecomposition.

Towards Singular Value Decomposition

Singular Value Decomposition (SVD) is based on the principle that all matrices are eigendecomposable after self-transposition. It is essentially a bug fix:

An important way to picture SVD is with the idea of orthogonal bases. It is relatively easy to find any number of orthogonal bases for a given rowspace. Call the matrix of orthogonal vectors $V$.

We desire to find an orthogonal basis $V$ such that $AV$ produces an orthogonal basis in the column space. Orthogonal bases are not particularly hard to find. But most orthogonal bases, once projected to column space, will lose their orthogonality! We desire that particular orthogonal basis $U$ such that $V$ is also orthogonal.

We won’t require vectors in $U$ to be the same size as those in $V$. Instead, we will normalize $V$; its basis vectors will be orthonormal (orthogonal and normal). Then the length of each vector in $U$ will diverge by a scaling factor.

As we will soon see, these scaling factors are not eigenvalues. Instead, we will use sigmas instead of lambdas.

• Scaling factors $\sigma$ , analogous to eigenvalues $\lambda$.
• Diagonal matrix $\Sigma$ , analogous to diagonal matrix $\Lambda$

Our full picture then, looks like this:

Let us now translate this image into matrix language.

$A \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ v_1 & v_2 & \dots & v_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \sigma_1u_1 & \sigma_2u_2 & \dots & \sigma_nu_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}$

But we can easily factorize the right-hand side:

$\begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \sigma_1u_1 & \sigma_2u_2 & \dots & \sigma_nu_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ u_1 & u_2 & \dots & u_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} * \begin{bmatrix} \sigma_1 & 0 & \dots & 0 \\ 0 & \sigma_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_n \\ \end{bmatrix}$

So we have that:

$AV = U \Sigma$

Since both $V$ and $U$ are orthogonal, inversion of either is equivalent to transposition:

$A = U \Sigma V^{-1} = U \Sigma V^T$

This strongly resembles our diagonalization equation $A = Q \Lambda Q^T$. SVD distinguishes itself by considering two orthogonal eigenmatrices $U$ and $V$, not just one ($Q$).

Recalling that $A = U \Sigma V^T$ ,

$A^TA = (V \Sigma^T U^T) (U \Sigma V^T)$

But now the innermost term cancels. Since $\Sigma$ is a square diagonal matrix, its self-transposition is simply equal to $\Sigma^{2}$. So,

$A^TA = V \Sigma^2 V^T$

Since $A^{T}A$ is a square, symmetric matrix, our diagonalization theorem applies!

$A^TA = V \Sigma ^2 V^T = Q \Lambda Q^T$

To find $U$, a similar trick works:

$AA^T = (U \Sigma V^T)(V \Sigma^T U^T) = U \Sigma^2 U^T = Q \Lambda Q^T$

The relationships between SVD and eigendecomposition are as follows:

• $V$ is the eigenvectors of $A^TA$
• $U$ is the eigenvectors of $AA^T$
• $\Sigma$ is the square root of the eigenvalues matrix $\Lambda$

If any eigenvalue is negative, the corresponding sigma factor would be complex. But $A^TA$ and $AA^T$ are positive-semidefinite, which guarantees non-negative eigenvalues. This assures us that $\Sigma$ contains only real values.

In contrast to eigendecomposition, every matrix has an SVD decomposition. Geometrically, $V$ and $U$ act as rotational transformations, and Sigma acts as a scaling transformation. In other words, every linear transformation comprises a rotation, then scaling, then another rotation.

A Worked Example

Let’s revisit $A$. Recall that:

$A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix}, A^TA = \begin{bmatrix} 25 & 7 \\ 7 & 25 \\ \end{bmatrix}, AA^T = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}$

Eigendecomposition against $A$ is unpleasant because $A$ is not symmetric. But $A^{T}A$ is guaranteed to be positive semi-definite; that is, to have non-negative eigenvalues. Let’s see this in action.

$det(A^TA - \lambda I) = \begin{bmatrix} 25 - \lambda & 7 \\ 7 & 25 - \lambda \\ \end{bmatrix} = 0$

$\lambda^2 - 50 \lambda + 576 = 0 \Rightarrow (\lambda_1 - 32)(\lambda_2 - 18) = 0$

$\lambda_1 = 32, \lambda_2 = 18$

$trace(A^TA) = 50 = \sum{\lambda_i}$

$det(A^TA) = 625-49 = 576 = 18 * 32 = \prod{\lambda_i}$

These are positive, real eigenvalues. Perfect! Let’s now derive the corresponding (normalized) eigenvectors.

$A - 32I = \begin{bmatrix} -7 & 7 \\ 7 & -7 \\ \end{bmatrix} = 0, A - 18I = \begin{bmatrix} 7 & 7 \\ 7 & 7 \\ \end{bmatrix} = 0$

$rref(A - 32I) = \begin{bmatrix} -1 & 1 \\ 0 & 0 \\ \end{bmatrix} = 0, rref(A-18I) = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ \end{bmatrix} = 0$

$v_1 = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}, v_2 = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}} \\ \end{bmatrix}$

SVD intends to decompose $A$ into $U \Sigma V^{T}$. The above findings give us two of these ingredients.

$V^{T} = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \end{bmatrix}$

$\Sigma =\begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix}$

What’s missing? $U$! To find it, we perform eigendecomposition on $AA^{T}$. This is an especially easy task, because $AA^{T}$ is already a diagonal matrix.

$AA^T = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}$

$U = \begin{bmatrix} u_1 & u_2 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix}$

We have arrived at our first Singular Value Decomposition.

$A = U \Sigma V^T = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & - \frac{1}{\sqrt{2}} \\ \end{bmatrix}$

Okay, so let’s check our work. 😛

$A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & - \frac{1}{\sqrt{2}} \\ \end{bmatrix} \right) = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} 4 & 4 \\ 3 & 3 \\ \end{bmatrix} = \begin{bmatrix} 4 & 4 \\ 3 & 3 \\ \end{bmatrix}$

These matrices are a viable factorization: multiplication successfully recovers $A$.

Takeaways

• Eigendecomposition only works for a subclass of matrices; SVD decomposes all matrices.
• SVD relies on self-transposition to convert any arbitrary matrix into one that works well against eigendecomposition (guarantees square $m = n$ and symmetric $A = A^{T}$).
• Another way to interpret SVD is by taking a special kind of orthogonal basis that, once passed through the linear transformation, preserves its orthogonality.
• Every matrix $A = U \Sigma V^{T}$. That is, every linear transformation can be conceived as rotation + scaling + rotation.

Until next time.

# Eigenvalues and Eigenvectors

Part Of: Algebra sequence
Followup To: An Introduction to Linear Algebra
Next Up: Singular Value Decomposition
Content Summary: 1300 words, 13 min read

Geometries of Eigenvectors

Matrices are functions that act on vectors, by mapping from row-vectors to column-vectors.  Consider two examples:

1. Reflection matrices, which reflect vectors across some basis.
2. Rotation matrices, which rotate vectors clockwise by $\theta$ degrees.

The set of eigenvectors of a matrix $A$ is a special set of input vectors for which the matrix behaves as a scaling transformation. In other words, we desire the set of vectors $\vec{x}$ whose output vectors $A\vec{x}$ differ by a scaling factor.

Eigenvectors have a straightforward geometric interpretation:

1. Reflection eigenvectors are orthogonal or parallel to the reflecting surface. In the left image above, that is the top two pairs of vectors.
2. Rotation eigenvectors do not exist (more formally, cannot be visualized in $\mathbb{R}^2$).

Algebra of Eigenvectors

We can express our “parallel output” property as:

$A\vec{x} = \lambda \vec{x}$

Thus $\vec{x}$ and $A\vec{x}$ point in the same direction, but differ by scaling factor $\lambda$.

Scaling factor $\lambda$ is the eigenvalue. There can be many $\left( x, \lambda \right)$ pairs that satisfy the above equality.

For an $\mathbb{R}^{n x n}$ matrix, there are $n$ eigenvalues. These eigenvalues can be difficult to find. However, two facts aid our search:

• The sum of eigenvalues equals the trace (sum of values along the diagonal).
• The product of eigenvalues equals the determinant.

To solve, subtract $\lambda \vec{x}$ from both sides:

$A\vec{x} = \lambda \vec{x}$

$(A - \lambda I)\vec{x} = 0$

We would like to identify n unique eigenvectors. But if the new matrix $(A - \lambda I)$ has an empty nullspace, it will contain zero eigenvectors. So we desire this new matrix to be singular.

How to accomplish this?  By finding eigenvalues that satisfy the characteristic equation $\det(A - \lambda I) = 0$. Matrices are singular iff their determinants equal zero.

Let’s work through an example! What is the eigendecompositon for matrix $A$:

$A = \begin{bmatrix} 3 & 1 \\ 1 & 3 \\ \end{bmatrix}$

We need to find eigenvalues that solve the characteristic equation.

$\det(A - \lambda I) = \begin{vmatrix} 3-\lambda & 1 \\ 1 & 3-\lambda \\ \end{vmatrix} = 0$

$(3 - \lambda)^2 - 1^2 = \lambda_2 -6\lambda + 8 = (\lambda-2)(\lambda-4) = 0$

$\lambda_1 = 2, \lambda_2 = 4$

Are these eigenvalues correct? Let’s check our work:

$trace(A) = 6 = \sum{\lambda_i}$

$det(A) = 8 = \prod{\lambda_i}$

How to find our eigenvectors? By solving the nullspace given each eigenvalue.

For $\lambda_1=2$ :

$A - 2I = \begin{bmatrix} 1 & 1 \\ 1 & 1 \\ \end{bmatrix} \Rightarrow rref(A - 2I) = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ \end{bmatrix}$

$(\lambda_1, \vec{x}_1) = (2, \begin{bmatrix} 1 \\ -1 \\ \end{bmatrix})$

For $\lambda_2=4$ :

$A - 4I = \begin{bmatrix} -1 & 1 \\ 1 & -1 \\ \end{bmatrix} \Rightarrow rref(A - 4I) = \begin{bmatrix} -1 & 1 \\ 0 & 0 \\ \end{bmatrix}$

$(\lambda_2, \vec{x}_2) = (4, \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix})$

Desirable Matrix Properties

The above example was fairly straightforward. But eigendecomposition can “go awry”, as we shall see. Consider a rotation matrix, which in two dimensions has the following form:

$R = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \\ \end{bmatrix}$

What are the eigenvalues for rotation $\theta = 90^{\circ}$ ?

$R = \begin{bmatrix} 0 & -1 \\ 1 & 0 \\ \end{bmatrix}$

$\det(R - \lambda I) = \begin{bmatrix} - \lambda & -1 \\ 1 & - \lambda \\ \end{bmatrix} = 0$

$(- \lambda)^2 - 1^2 \Rightarrow \lambda^2 = 1$

$\lambda_1 = i, \lambda_2 = -i$

We can check our work:

$trace(R) = 0 = \sum{\lambda_i}$

$det(R) = 1 = \prod{\lambda_i}$

We saw earlier that rotation matrices have no geometric interpretation. Here, we have algebraically shown that its eigenvalues are complex.

$A = \left[ \begin{smallmatrix} 3 & 1 \\ 1 & 3 \\ \end{smallmatrix} \right]$ has real eigenvalues, but $R = \left[ \begin{smallmatrix} 0 & -1 \\ 1 & 0 \\ \end{smallmatrix} \right]$ has less-desirable complex eigenvalues.

We can generalize the distinction between $A$ and $R$ as follows:

Spectral Theorem. Any matrix that is symmetric (A = AT) is guaranteed to have real, nonnegative eigenvalues. The corresponding n eigenvectors are guaranteed to be orthogonal.

In other words, eigendecomposition works best against symmetric matrices.

Diagonalization

Let us place each eigenvector in the column of a matrix $S$. What happens when you multiply the original matrix $A$ by this new matrix? Since $S$ contains eigenvectors, multiplication by $A$ reduces to multiplication by the associated eigenvalues:

$AS = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \lambda_1x_1 & \lambda_2x_2 & \dots & \lambda_nx_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}$

We see the product contains a mixture of eigenvalues and eigenvectors. We can separate these by “pulling out” the eigenvalues into a diagonal matrix. Call this matrix $\Lambda$ (“capital lambda”).

$AS = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ x_1 & x_2 & \dots & x_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} * \begin{bmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_n \\ \end{bmatrix} = S \Lambda$

Most matrices have the property that its eigenvectors are linearly independent. For such matrices, $S$ is invertible. Given this fact, we can solve for $A$:

$\Lambda = S^{-1} A S$

$A = S \Lambda S^{-1}$

Matrices that can be factorized in this way are said to be diagonalizable. We can see that both elimination and eigendecomposition are performing the same type of work: factorizing matrices into their component parts.

If $A$ is symmetric, then we know $\Lambda$ is real, and its eigenvectors in $S$ are orthogonal. Let us rename $S$ to be $Q$, to reflect this additional property. But orthogonal matrices have the property that transposition equats inversion: $Q^T = Q^{-1}$. Thus, if $A$ is symmetric, we can simplify the diagonalization formula to:

$A = Q \Lambda Q^{-1} = Q \Lambda Q^T$

Asymptoptic Interpretations

This diagonalization approach illustrates an important use case of eigenvectors: power matrices. What happens when $A$ is applied arbitrarily many times? What does the output look like in the limit?

We can use the diagonalization equation to represent $A^k$:

$A^k = \prod_{i=1}^{k} (Q^{-1} \Lambda Q) = (Q^{-1} \Lambda Q)(Q^{-1} A Q)\dots(Q^{-1} \Lambda Q)$

We can simplify by canceling the inner terms $QQ^{-1}$:

$A^k = Q^{-1} \Lambda^k Q$

This equation tells us that the eigenvectors is invariant to how many times $A$ is applied. In contrast, eigenvalue matrix $\Lambda$ has important implications for ongoing processes:

• If each eigenvalue has magnitude less than one, the output will trend towards zero.
• If each eigenvalue has magnitude greater than one, the output will trend to infinity.

Fibonacci Eigenvalues

The powers interpretation of eigenvalues sheds light on the behavior of all linear processes. This includes number sequences such as the Fibonacci numbers, where each number is the sum of the previous two numbers.

Recall the Fibonacci numbers are $0,1,1,2,3,5,8,13,...$ What is $F_{100}$ ?

Eigenvalues can answer this question. We must first express the Fibonacci generator as a linear equation:

$F(k+2) = 1F(k+1) + 1F(k)$

In order to translate this into a meaningful matrix, we must add a “redundant” equation

$F(k+1) = 1F(k+1) + 0F(k)$

With these equations, we can create a 2×2 Fibonacci matrix $F$.

$F = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix}$

This matrix uniquely generates Fibonacci numbers.

$u_1 = Fu_0 = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix} \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix}$

$u_4 = F^4u_0 = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix}^4 \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} = \begin{bmatrix} 5 \\ 3 \\ \end{bmatrix}$

To discover the rate at Fibonacci numbers grow, we decompose $F$ into its eigenvalues:

$\det(F - \lambda I) = \begin{vmatrix} 1 - \lambda & 1 \\ 1 & 1- \lambda \\ \end{vmatrix} = 0$

$\lambda^2 - 2\lambda - 1 = 0$

$\lambda_1 = \frac{1 + \sqrt{5}}{2}, \lambda_2 = \frac{1 - \sqrt{5}}{2}$

$\lambda_1 = 1.61803, \lambda_2 = -0.61803$

$trace(F) = 1 = \sum{\lambda_i}$

$det(F) = -1 = \prod{\lambda_i}$

We can go on to discover eigenvectors $x_1$ and $x_2$. We can then express the Fibonnaci matrix $F$ as

$F = \lambda_1x_1 + \lambda_2 x_2$

$F^k = \lambda_1^k x_1 + \lambda_2^k x_2$

As k goes to infinity, the second term goes to zero. Thus, the ratio is dominated by the larger eigenvalue, 1.61803.

Mathematicians in the audience will recognize this number as the golden ratio.

We have long known that the ratio of successive Fibonnaci numbers converges to 1.61803. Eigenvalues provide a mechanism to derive this value analytically.

Until next time.

# Algorithmic Dark Matter: Duality in Linear Programming

Part Of: Optimization sequence
Content Summary: 800 words, 8 min read.

Today, I introduce the concept of duality. Buckle your seat belts! 🙂

Max Flow algorithm

The problem of flow was originally studied in the context of the USSR railway system in the 1930s. The Russian mathematician A.N. Tolstoy published his Methods of finding the minimal total kilometrage in cargo-transportation planning in space, where he formalized the problem as follows.

We interpret transportation graphically. Vertices are interpreted as cities, and edges are railroads connecting two cities. The capacity of each edge was the amount of goods that particular railroad could transport in a given day. The bottleneck is solely the capacities, and not production and consumption. We assume no available storage at the intermediate cities.

The flow allocation problem defines source and termination vertices, s and t. We desire to maximize volume of goods transported s → t. To do this, we label each edge with the amount of goods we intend to ship on that railroad. This quantity, which we will call flow, must respect the following properties:

• Flow cannot exceed the capacity of the railroad.
• Flow is conserved: stuff leaving a city must equal the amount of stuff arriving.

Here are two possible solutions to flow allocation:

Solution B improves on A by pushing volume onto the b → c railway. But are there better solutions?

To answer rigorously, we formalize max flow as a linear optimization problem:

The solution to LP tells us that no, eight is the maximum possible flow.

Min Cut algorithm

Consider another, seemingly unrelated, problem we might wish to solve: separability. Let X ⊂ E represent the number of edges you need to remove to eliminate a connection s → t. Here are two such solutions:

Can we do better than B? The answer is no: { (b,t), (c,d) } is the minimum cut possible in the current graph. In linear programming terms, it is the Best Feasible Solution (BFS)

Note that the BFS of minimum cut and the BFS of max flow arrive at the same value. 8 = 8. This is not a coincidence. In fact, these problems are intimately related to one another. What the min cut algorithm is searching for is the bottleneck: the smallest section of the “pipe” from s → t. For complex graphs like this, it is not trivial to derive this answer visually; but the separability algorithm does the work for us.

The deep symmetry between max flow and min cut demonstrates an important mathematical fact. All algorithms come in pairs. For this example, we will call max flow and min cut the primal and dual problems, respectively. We will explore the ramifications for this another time. For now, let’s approach duality from an algebraic perspective.

Finding LP Upper Bound

Consider a linear program with the following objective function:

$\max (2x_1 + x_2)$

And these constraints

$4x_1 + x_2 \leq 6$

$x_1 + 2x_2 \leq 5$

$x_1, x_2 \geq 0$

This program wants to find the largest solution possible given constraints. Can we provide an upper bound on the solution?

Yes. We can immediately say that the solution is no greater than 6. Why? The objective function, $2x_1 + x_2$ is always smaller than $4x_1 + x_2$, because we know all variables are positive. So we have an upper bound OPT ≤ 6. We can sharpen this upper bound, by comparing the objective function to other linear combinations of the constraints.

Different weights to our linear combinations produce different upper bounds:

• (1,0) → 6
• (0,1) → 5
• (⅓, ⅓ ) → 3.67

Let us call these two weights $(y_1, y_2)$. What values of these variables give us the smallest upper bound? Importantly, this is itself an objective function: $\min (6y_1 + 5y_2)$.

But $y_1$ and $y_2$ are constrained: they must produce an equation that exceeds $2x_1 + x_2$. Thus,

$y_1(a) + y_2(b) \geq 2x_1 + x_2$

$y_1 \left( 4x_1 + x_2 \right) + y_2 \left( 3x_1 + 2x_2 \right) \geq 2x_1 + x_2$

$\left(4y_1 + 3y_2 \right) x_1 + \left (y_1 + 2y_2 \right) x_2 \geq 2x_1 + x_2$

$(4y_1 + 3y_2) \geq 2$ and $(y_1 + 2y_2) \geq 1$

This gives us our two constraints. Thus, by looking for the lowest upper bound on our primal LP, we have derived our dual LP:

Note the extraordinary symmetry between primal and dual LPs. The purple & orange values are mirror images of one another. Further, the constraint coefficient matrix has transposed (the 3 has swapped along the diagonal). This symmetry is reflected in the above linear algebra formulae.

A Theory of Duality

Recall that linear programs have three possible outcomes: infeasible (no solution exists), unbounded (solution exists at +/-∞) or feasible/optimal. Since constraints are nothing more than geometric half-spaces, these possible outcomes reflect three kinds of polyhedra:

The outcome of primal and dual programs are predictably correlated. Of the nine potential pairings, only four can actually occur:

1. Both P and D are infeasible
2. P is unbounded and D is infeasible
3. D is unbounded and P is infeasible
4. Both are feasible, and there exist optimal solutions

Finally, in the above examples, we saw that the optimal dual value $p^* = d^* (8=8)$. But this is not always the case. In fact, the optimal dual value can be smaller .

We can distinguish between two kinds of duality:

• Strong duality, where $p^* = d^*$
• Weak duality, where $p^* - d^* \geq 0$

Takeaways

Today, we have illustrated a deep mathematical fact: all problems come in pairs. Next time, we will explore the profound ramifications of this duality principle.

Related Resources: CMU Lecture 5: LP Duality

# Entropy as Belief Uncertainty

Part Of: Information Theory sequence
Content Summary: 900 words, 9 min read

Motivations

What do probabilities mean?

A frequentist believes that they represent frequencies. P(snow) =10% means that on 100 days just like this one, 10 of them will have snow.

A Bayesian, on the other hand, views probability as degree of belief. P(snow) = 10% means that you believe there is a 10% chance it will snow today.

This subjective approach views reasoning as probability (degree of belief) spread over possibility. On this view, Bayes Theorem provides a complete theory of inference:

From this equation, we see how information updates our belief probabilities. Bayesian updating describes this transition from prior to posterior, P(H) → P(H|E).

As evidence accumulates, one’s “belief distributions” tend to become sharply peaked. Here, we see degree of belief in a hockey goalie’s skill, as we observe him play. (Image credit Greater Than Plus Minus):

What does it mean for a distribution to be uncertain? We would like to say that our certainty grows as the distribution sharpens. Unfortunately, probability theory provides no language to quantify this intuition.

This is where information theory comes to the rescue. In 1948 Claude Shannon discovered a unique, unambiguous way to measure probabilistic uncertainty.

What is this function? And how did he discover it? Let’s find out.

Desiderata For An Uncertainty Measure

We desire some quantity H(p) which measures the uncertainty of a distribution.

To derive H, we must specify its desiderata, or what we want it to do. This task may feel daunting. But in fact, very simple conditions already determine H to within a constant factor.

We require H to meet the following conditions:

1. Continuous. H(p) is a continuous function.
2. Monotonic. H(p) for an equiprobable distribution (that is, A(n) = H(1/n, 1/n, 1/n)) is a monotonic increasing function of n.
3. Compositionally Invariant. If we reorganize X by bundling individual outcomes into single variables (b: X → W), H is unchanged, H(X) = H(W).

Let’s explore compositional invariance in more detail.

Deriving H

Let us consider some variable $X$ that can assume discrete values $(x_1, ..., x_n)$. Our partial understanding of the processes which determine $X$ are the probabilities $(p_1, ..., p_n)$. We would like to find some $H(p_1, ..., p_n)$, which measures the uncertainty of this distribution.

Suppose $X$ has three possible outcomes. We can derive $W$ by combining events xand x3

The uncertainty of $X$ must be invariant to such bundling. So we have that:

The right tree has two distributions $p(W)$ and $p(X|W)$. The uncertainty of two distributions is the sum of each individual uncertainty. Thus we add H(⅔, ⅓). But this distribution is reached only ½ of the time, so we multiply by 0.5.

How does composition affect equiprobable distributions $A(n)$? Consider a new $X$ with 12 possible outcomes, each equally likely to occur. The uncertainty $H(X) = A(12)$, by definition. Suppose we choose to bundle these branches by $(3,5,4)$. Then we have:

But suppose we choose a different bundling function $(4,4,4)$. This simplifies things:

For what function of $A$ does $A(mn) = A(m) + A(n)$ hold? There is only one solution, as shown in Shannon’s paper:

$A(X) = - Klog(X)$

$K$ varies with logarithmic base (bits, trits, nats, etc). With this solution we can derive a general formula for entropy $H$.

Recall,

$X = (x_1, ..., x_n), P(X) = (p_1, ..., p_n)$

$A(X) = K \log(X)$ ← Found by uniform bundling (eg., $4,4,4$)

$A(\sum{n}) = H(X) + \sum\limits_{i} \left( \frac{b_i}{\sum{n}} \right) A(b_i)$ ← Found by arbitrary bundling (eg., $3,5,4$)

Hence,

$Klog(\sum{n_i}) = H(X) + K \sum{p_i \log(n_i)}$

$K \left[ \sum{p_i \log(\sum{n_i})} - \sum{p_i \log(n_i)} \right]$

$H = -K \sum{p_i \log\left(\frac{n}{\sum{n_i}} \right)}$

We have arrived at our definition of uncertainty, the entropy H(X):

$H(X) = -K \sum{p_i \log(p_i)}$

To illustrate, consider a coin with bias p.  Our uncertainty is maximized for a fair coin, p = 0.5, and smallest at p = 0.0 (certain tails) or 1.0 (certain heads).

Entropy vs Information

What is the relationship between uncertainty and information? To answer this, we must first understand information.

Consider the number of possible sentences in a book. Is this information? Two books contain exponentially more possible sentences than one book.

When we speak of information, we desire it to scale linearly with its length. Two books should contain approximately twice as much information.

If we take the logarithm of the possible messages $W$, we can preserve this intuition:

$I(X) = K \log(W) = K \sum{P(X)}$

Recall that,

$H(X) = -K \sum{P_i(X) \log P_i(X)}$

From here, we can show that entropy is expected information:

$H(X) = \sum{P_i(X) \log P_i(X)}$

$H = E\langle I \rangle$

What does this discovery mean, though?

Imagine a device that produces 3 symbols, A, B, or C. As we wait for the next symbol, we are uncertain which symbol comes next. Once a symbol appears our uncertainty decreases, because we have received more information. Information is a decrease in entropy.

If A, B, and C occur at the same frequency, we should not be surprised to see any one letter. But if P(A) approaches 0, then we will be very surprised to see it appear, and the formula says I(X) approaches ∞. For the receiver of a message, information represents surprisal.

On this interpretation, the above formula becomes clear. Uncertainty is anticipated surprise. If our knowledge is incomplete, we expect surprise. But confident knowledge is “surprised by surprise”.

Conclusions

The great contribution of information theory lies in a measure for probabilistic uncertainty.

We desire this measure to be continuous, monotonic, and compositionally invariant. There is only one such function, the entropy H:

$H(X) = -K \sum{p_i \log(p_i)}$

This explains why a broad distribution is more uncertain than one that is narrow.

Henceforth, we will view the words “entropy” and “uncertainty” as synonymous.

Related Works

• Shannon (1948). A Mathematical Theory of Communication
• Jaynes (1957). Information Theory and Statistical Mechanics
• Schneider (1995). Information theory primer

# Markov Decision Processes

Part Of: Reinforcement Learning sequence
Followup To: An Introduction To Markov Chains
Content Summary: 900 words, 9 min read

Motivations

Today, we turn our gaze to Markov Decision Processes (MDPs), a decision-making environment which supports our propensity to learn from good and bad outcomes. We represent outcome desirability with a single number, R. This value is used to refine action selection: given a particular situation, what action will maximize expected reward?

In biology, we can describe the primary work performed by an organism is to maintain homeostasis: maintaining metabolic energy reserves, body temperature, etc in a widely varying world.

Cybernetics provide a clear way of conceptualizing biological reward. In Neuroendocrine Integration, we discussed how brains must respond both to internal and external changes. This dichotomy expresses itself as two perception-action loops: a visceral body-oriented loop, and a cognitive world-centered one.

Rewards are computed by the visceral loop. To a first approximation, reward encode progress towards homeostasis. Food is perceived as more rewarding when the body is hungry, this is known as alliesthesia. Reward information is delivered to the cognitive loop, which helps refine its decision making.

Extending Markov Chains

Recall that a Markov Chain contains a set of states S, and a transition model P. A Markov Decision Process (MDP) extends this device, by adding three new elements.

Specifically, an MDP is a 5-tuple (S, P, A, R, ɣ):

• A set of states s ∈ S
• A transition model Pa(s’ | s).
• A set of actions a ∈ A
• A reward function R(s, s’)
• A discount factor ɣ

To illustrate, consider GridWorld. In this example, every location in this two-dimensional grid is a state, for example (1,0). State (3,0) is a desirable location: R(s(3,0)) = +1.0, but state (3,1) is undesirable, R(s(3,1)) = -1.0. All other states are neutral.

Gridworld supports four actions, or movements: up, down, left, and right.  However, locomotion is imperfect: if Up is selected, the agent will only move up with 80% probability: 20% of the time it will go left or right instead. Finally, attempting to move into a forbidden square will simply return the agent to its original location (“hitting the wall”).

The core problem of MDPs is to find a policy (π), a function that specifies the agent’s response to all possible states. In general, policies should strive to maximize reward, e.g., something like this:

Why is the policy at (2,2) Left instead of Up? Because (2,1) is dangerous: despite selecting Up, there is a 10% chance that the agent will accidentally move Right, and be punished.

Let’s now consider an environment with only three states A, B, and C.  First, notice how different policies change the resultant Markov Chain:

This observation is important. Policy determines the transition model.

Towards Policy Valuation V(s)

An agent seeks to maximize reward. But what does that mean, exactly?

Imagine an agent selects 𝝅1. Given the resultant Markov Chain, we already know how to use matrix multiplication to predict future locations St. The predicted reward Pt is simply the dot product of expected location and the reward function.

$P_t = S_t \cdot R$

We might be tempted to define the value function V(S) as the sum of all predicted future rewards:

$V_O(S) = P_0 + P_1 + P_2 + P_3 + \dots = \sum{P_k}$

However, this approach is flawed.  Animals value temporal proximity: all else equal, we prefer to obtain rewards quickly. This is temporal discounting: as rewards are further removed from the present, their value is discounted.

In reinforcement learning, we implement temporal discounting with the gamma parameter: rewards that are k timesteps away are multiplied by the exponential discount factor $\gamma^k$. The value function becomes:

$V_O(S) = P_0 + \gamma P_1 + \gamma^2 P_2 + \gamma^3 P_3 + \dots = \sum{\gamma^k P_k}$

Without temporal discounting, V(s) can approach infinity. But exponential discounting ensures V(s) equals a finite valueFinite valuations promote easier computation and comparison of state evaluations. For more on temporal discounting, and an alternative to the RL approach, see An Introduction to Hyperbolic Discounting.

Intertemporal Consistency

In our example, at time zero our agent starts in state A. We have already used linear algebra to compute our Pk predictions. To calculate value, we simply compute $latex \sum{\gamma^k P_k}$

$V_0(A) = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3$

Agents compute V(s) at every time step. At t=1, two valuations are relevant:

$V_1(A) = 0 + 0 + 0.64 \gamma^2 + \dots$

$V_1(B) = 0 + 0.8 \gamma + 0.96 \gamma^2 + \dots$

What is the relationship between the value functions at t=0 and t=1? To answer this, we need to multiply each term by $\gamma P(X|A)$, where $X$ is the state being considered at the next time step.

$W_1(A) \triangleq \gamma 0.2 V_1(A)$

$W_1(A) = 0 + 0 + (0.2)(0.64)\gamma^3 + \dots$

Similarly,

$W_1(B) \triangleq \gamma P(B|A)V_1(B) = \gamma 0.8 V_1(B)$

$W_1(B) 0 + (0.8)(0.8) \gamma^2 + (0.8)(0.96) \gamma^3 + \dots$

Critically, consider the sum $X = r_0(s) + W_1(A) + W_1(B)$:

$X = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3 + \dots$

Does $X_0$ look familiar? That’s because it equals $V_0(A)$! In this way, we have a way of equating a valuation at t=0 and t=1. This property is known as intertemporal consistency.

Bellman Equation

We have seen that $V_0(A) = X_0$. Let’s flesh out this equation, and generalize to time t.

$V_t(s) = r_t(A) + \gamma \sum{P(s'|s)V_{t+1}(s')}$

This is the Bellman Equation, and it is a central fixture in control systems. At its heart, we define value in terms of both immediate reward and future predicted value. We thereby break up a complex problem into small subproblems, a key optimization technique that can be approached with dynamic programming.

Next time, we will explore how reinforcement learning uses the Bellman Equation to learn strategies with which to engage its environment (the optimal policy 𝝅). See you then!

# Getting Real With Continued Fractions

Content Summary: 600 words, 6 min read

And now, an unprovoked foray into number theory!

Simple Continued Fractions (SCFs)

Have you run into simple continued fractions in your mathematical adventures? They look like this:

Let $A$ represent the coefficients $(a_0, a_1, a_2, a_3, ...)$ and $B = ( b_1, b_2, b_3, ...)$. If you fix $B = (1, 1, 1, ...)$ you can uniquely represent $n$ with $A(n)$. For example:

$n = \frac{415}{93} = 4+\frac{1}{2+\frac{1}{6+\frac{1}{7}}}$

$A(n) = (4,2,6,7)$

Let us call $A(n)$ the leading coefficients of n. Here we have represented the rational $\frac{415}{93}$ with four coefficients. It turns out that every rational number can be expressed with a finite number of leading coefficients.

Irrational Numbers

Life gets interesting when you look at the leading coefficients of irrational numbers. Consider the following:

$A(\phi) = (1, 1, 1, 1, 1, 1, 1, ...)$

$A(\sqrt{19}) = (4, 2, 1, 3, 1, 2, 8, 2, 1, 3, 1, 2, 8, ...)$

$A(e) = (2, 1, 2, 1, 1, 4, 1, 1, 6, 1, 1, 8, ...)$

$A(\pi) = (3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, ...)$

First note that these irrational numbers have an infinite number of leading coefficients.

What do you notice about $A(\phi)$? It repeats, of course! What is the repeating sequence for $A(\sqrt{19})$? The sequence $213128$.

How about $A(e)$? Well, after the first two digits, we notice an interesting pattern $211$ then $411$ then $811$. The value of this triplet is non-periodic, but easy enough to compute. The situation looks even more bleak when you consider the $A(\pi)$

Thus $\phi$ (golden ratio) and $\sqrt{19}$ feature repeating coefficients, but $\pi$ and $e$ (Euler’s number) do not. What differentiates these groups?

Of these numbers, only the transcendental numbers fail to exhibit a period. Can this pattern be generalized? Probably. 🙂 There exists an unproved conjecture in number theory, that all infinite, non-periodic leading coefficients with bounded terms are transcendental.

Real Approximation As Coefficient Trimming

Stare the digits of $\pi$. Can you come up with a fraction that approximates it?

Perhaps you have picked up the trick that $\frac{22}{7}$ is surprisingly close:

$\pi = 3.14159265359$

$\dfrac{22}{7} = \textbf{3.14}285714286$

But could you come up with $\frac{22}{7}$ from first principles? More to the point, could you construct a fraction that comes yet closer to $\pi$ ‘s position on the number line?

Decomposing these numbers into continued fractions should betray the answer:

$A(\pi) = (3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, ...)$

$A\left(\dfrac{22}{7}\right) = (3, 7)$

We can approximate any irrational number by truncating $A(\pi)$. Want a more accurate approximation of $\pi$? Keep more digits:

$(3, 7, 15, 1) = A(\dfrac{355}{113})$

$\dfrac{355}{113} = \textbf{3.141592}92035$

I’ll note in passing that this style of approximation resembles how algorithms approximate the frequency of signals by discarding smaller eigenvalues.

Much ink has been spilled on the number $\pi$. For example, does it contain roughly equal frequencies of 3s and 7s? When you generalize this question to any base (not just base 10), the question becomes whether $\pi$ is a normal number. Most mathematicians suspect the answer is Yes, but this remains pure conjecture to-date.

Let’s return to the digits of $A( \pi )$. Here is a graph of the first two hundred:

Do you see a pattern? I don’t.

Let’s zoom out. This encyclopedia displays the first 20,000 coefficients of $A( \pi )$:

So $A(\pi)$ affords no obvious pattern. Is there another way to generate the digits of $\pi$ such that a pattern emerges?

Let quadratic continued fraction represent a number $n$ expressed as:

Set $A = (1, 2, 2, 2, 2, ... )$. Here only $B = ( b_1, b_2, b_3, ...)$ is allowed to vary. Astonishingly, the following fact is true:

$B\left(\dfrac{4}{\pi}\right) = (1, 3, 5, 7, 9, 11, 13, 15, 17... )$

Thus, continued fractions allow us to make sense out of important transcendental numbers like $\pi$.

I’ll close with a quote:

Continued fractions are, in some ways, more “mathematically natural” representations of a real number than other representations such as decimal representations.