[Sequence] Cognitive Architecture

Central Organizing Principles

Main Sequence

Version Iterations

  • Version 0.1: The mind as a collection of functional modules.
  • Version 1.0: Integrating anatomy and genetics.
  • Version 1.1: Integrating ecology, morality, frame memory, and social systems.
  • Version 1.2: Integrating theory of mind, social reasoning and culture.
  • Version 1.3: Integrating memory hierarchy, motivation, frame memory.
  • Version 1.4: Integrating working memory, attention, consciousness.
  • Version 1.5: Integrating cognitive control, affective neuroscience.

[Sequence] Statistics

Core Sequence

Bayesian Statistics

Singular Value Decomposition

Part Of: Algebra sequence
Followup To: Eigenvalues and Eigenvectors
Content Summary: 1300 words, 13 min read.

Limitations of Eigendecomposition

Last time, we learned how to locate eigenvalues and eigenvectors of a given matrix. Diagonalization is the process where a square matrix can be decomposed into two factors: a matrix Q (with eigenvectors along the columns) and a matrix \Lambda (with eigenvalues along the diagonal).

A = Q \Lambda Q^T

But as we saw with the spectral theorem, eigendecomposition only works well against square, symmetric matrices. If a matrix isn’t symmetric, it is easy to run into complex eigenvalues. And if a matrix isn’t square, you are out of luck entirely!

Eigendecomposition- Spectral Theorem (1)

Can we generalize eigendecomposition to apply to a wider family of matrices? Can we diagonalize matrices of any size, even if they don’t have “nice” properties?

Yes. Self-transposition is the key insight of our “eigendecomposition 2.0”. We define the self-transpositions of A as AA^{T} and A^{T}A.

Suppose A \in \mathbb{R}^{m x n}. Then AA^T \in \mathbb{R}^{mxm} and A^TA \in \mathbb{R}^{nxn}. So these matrices are square. But they are also symmetric!

To illustrate, consider the following.

A = \begin{bmatrix}  4 & 4 \\ -3 & 3 \\ \end{bmatrix}

Since A is not symmetric, we have no guarantee that its eigenvalues are real. Indeed, its eigenvalues turn out to be complex:

\det(A - \lambda I) = \begin{bmatrix} 4 - \lambda & 4 \\ -3 & 3 - \lambda \\ \end{bmatrix} = 0

(12 - 7 \lambda + \lambda^2) + 12 = 0 \Rightarrow \lambda^2 -7 \lambda + 24 = 0

\lambda = \frac{7 \pm \sqrt{(-7)^2 - 4*1*24}}{2*1} = \frac{7}{2} \pm \frac{\sqrt{47}i}{2}

Eigendecomposition on A sucks. Are the self-transposed matrices any better?

A^TA = \begin{bmatrix} 4 & -3 \\ 4 & 3 \\ \end{bmatrix} \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix} = \begin{bmatrix} 25 & 7 \\ 7 & 25 \\ \end{bmatrix}

AA^T = \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix} \begin{bmatrix} 4 & -3 \\ 4 & 3 \\ \end{bmatrix} = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}

These matrices are symmetric! Thus, they are better candidates for eigendecomposition.

Towards Singular Value Decomposition

Singular Value Decomposition (SVD) is based on the principle that all matrices are eigendecomposable after self-transposition. It is essentially a bug fix:

SVD- Self-Transposition

An important way to picture SVD is with the idea of orthogonal bases. It is relatively easy to find any number of orthogonal bases for a given rowspace. Call the matrix of orthogonal vectors V.

We desire to find an orthogonal basis V such that AV produces an orthogonal basis in the column space. Orthogonal bases are not particularly hard to find. But most orthogonal bases, once projected to column space, will lose their orthogonality! We desire that particular orthogonal basis U such that V is also orthogonal.

We won’t require vectors in U to be the same size as those in V. Instead, we will normalize V; its basis vectors will be orthonormal (orthogonal and normal). Then the length of each vector in U will diverge by a scaling factor.

As we will soon see, these scaling factors are not eigenvalues. Instead, we will use sigmas instead of lambdas.

  • Scaling factors \sigma , analogous to eigenvalues \lambda.
  • Diagonal matrix \Sigma , analogous to diagonal matrix \Lambda

Our full picture then, looks like this:

SVD- Orthogonal Basis Transfer

Let us now translate this image into matrix language.

A \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ v_1 & v_2 & \dots & v_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \sigma_1u_1 & \sigma_2u_2 & \dots & \sigma_nu_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}

But we can easily factorize the right-hand side:

\begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \sigma_1u_1 & \sigma_2u_2 & \dots & \sigma_nu_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ u_1 & u_2 & \dots & u_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} * \begin{bmatrix} \sigma_1 & 0 & \dots & 0 \\ 0 & \sigma_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_n \\ \end{bmatrix}

So we have that:

AV = U \Sigma

Since both V and U are orthogonal, inversion of either is equivalent to transposition:

A = U \Sigma V^{-1} = U \Sigma V^T

This strongly resembles our diagonalization equation A = Q \Lambda Q^T. SVD distinguishes itself by considering two orthogonal eigenmatrices U and V, not just one (Q).

Recalling that A = U \Sigma V^T ,

A^TA = (V \Sigma^T U^T) (U \Sigma V^T)

But now the innermost term cancels. Since \Sigma is a square diagonal matrix, its self-transposition is simply equal to \Sigma^{2}. So,

A^TA = V \Sigma^2 V^T

Since A^{T}A is a square, symmetric matrix, our diagonalization theorem applies!

A^TA = V \Sigma ^2 V^T = Q \Lambda Q^T

To find U, a similar trick works:

AA^T =  (U \Sigma V^T)(V \Sigma^T U^T) = U \Sigma^2 U^T = Q \Lambda Q^T

The relationships between SVD and eigendecomposition are as follows:

  • V is the eigenvectors of A^TA
  • U is the eigenvectors of AA^T
  • \Sigma is the square root of the eigenvalues matrix \Lambda

If any eigenvalue is negative, the corresponding sigma factor would be complex. But A^TA and AA^T are positive-semidefinite, which guarantees non-negative eigenvalues. This assures us that \Sigma contains only real values.

In contrast to eigendecomposition, every matrix has an SVD decomposition. Geometrically, V and U act as rotational transformations, and Sigma acts as a scaling transformation. In other words, every linear transformation comprises a rotation, then scaling, then another rotation.

SVD- Rotation Scaling Rotation (1)

A Worked Example

Let’s revisit A. Recall that:

A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix}, A^TA = \begin{bmatrix} 25 & 7 \\ 7 & 25 \\ \end{bmatrix}, AA^T = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}

Eigendecomposition against A is unpleasant because A is not symmetric. But A^{T}A is guaranteed to be positive semi-definite; that is, to have non-negative eigenvalues. Let’s see this in action.

det(A^TA - \lambda I) = \begin{bmatrix} 25 - \lambda & 7 \\ 7 & 25 - \lambda \\ \end{bmatrix} = 0

\lambda^2 - 50 \lambda + 576 = 0 \Rightarrow (\lambda_1 - 32)(\lambda_2 - 18) = 0

\lambda_1 = 32, \lambda_2 = 18

trace(A^TA) = 50 = \sum{\lambda_i}

det(A^TA) = 625-49 = 576 = 18 * 32 = \prod{\lambda_i}

These are positive, real eigenvalues. Perfect! Let’s now derive the corresponding (normalized) eigenvectors.

A - 32I = \begin{bmatrix} -7 & 7 \\ 7 & -7 \\ \end{bmatrix} = 0, A - 18I = \begin{bmatrix} 7 & 7 \\ 7 & 7 \\ \end{bmatrix} = 0

rref(A - 32I) = \begin{bmatrix} -1 & 1 \\ 0 & 0 \\ \end{bmatrix} = 0, rref(A-18I) = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ \end{bmatrix} = 0

v_1 = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}, v_2 = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}} \\ \end{bmatrix}

SVD intends to decompose A into U \Sigma V^{T}. The above findings give us two of these ingredients.

V^{T} = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \end{bmatrix}

\Sigma =\begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix}

What’s missing? U! To find it, we perform eigendecomposition on AA^{T}. This is an especially easy task, because AA^{T} is already a diagonal matrix.

AA^T = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}

U = \begin{bmatrix} u_1 & u_2 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix}

We have arrived at our first Singular Value Decomposition.

A = U \Sigma V^T = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & - \frac{1}{\sqrt{2}} \\ \end{bmatrix}

Okay, so let’s check our work. 😛

A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & - \frac{1}{\sqrt{2}} \\ \end{bmatrix} \right) = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} 4 & 4 \\ 3 & 3 \\ \end{bmatrix} = \begin{bmatrix} 4 & 4 \\ 3 & 3 \\ \end{bmatrix}

These matrices are a viable factorization: multiplication successfully recovers A.

Takeaways

  • Eigendecomposition only works for a subclass of matrices; SVD decomposes all matrices.
  • SVD relies on self-transposition to convert any arbitrary matrix into one that works well against eigendecomposition (guarantees square m = n and symmetric A = A^{T}).
  • Another way to interpret SVD is by taking a special kind of orthogonal basis that, once passed through the linear transformation, preserves its orthogonality.
  • Every matrix A = U \Sigma V^{T}. That is, every linear transformation can be conceived as rotation + scaling + rotation.

Until next time.

Eigenvalues and Eigenvectors

Part Of: Algebra sequence
Followup To: An Introduction to Linear Algebra
Next Up: Singular Value Decomposition
Content Summary: 1300 words, 13 min read

Geometries of Eigenvectors

Matrices are functions that act on vectors, by mapping from row-vectors to column-vectors.  Consider two examples:

  1. Reflection matrices, which reflect vectors across some basis.
  2. Rotation matrices, which rotate vectors clockwise by \theta degrees.

eigenvectors-transformation-geometries-1

The set of eigenvectors of a matrix A is a special set of input vectors for which the matrix behaves as a scaling transformation. In other words, we desire the set of vectors \vec{x} whose output vectors A\vec{x} differ by a scaling factor.

Eigenvectors have a straightforward geometric interpretation:

  1. Reflection eigenvectors are orthogonal or parallel to the reflecting surface. In the left image above, that is the top two pairs of vectors.
  2. Rotation eigenvectors do not exist (more formally, cannot be visualized in \mathbb{R}^2).

Algebra of Eigenvectors

We can express our “parallel output” property as:

A\vec{x} = \lambda \vec{x}

Thus \vec{x} and A\vec{x} point in the same direction, but differ by scaling factor \lambda.

Scaling factor \lambda is the eigenvalue. There can be many \left( x, \lambda \right) pairs that satisfy the above equality.

For an \mathbb{R}^{n x n} matrix, there are n eigenvalues. These eigenvalues can be difficult to find. However, two facts aid our search:

  • The sum of eigenvalues equals the trace (sum of values along the diagonal).
  • The product of eigenvalues equals the determinant.

To solve, subtract \lambda \vec{x} from both sides:

A\vec{x} = \lambda \vec{x}

(A - \lambda I)\vec{x} = 0

We would like to identify n unique eigenvectors. But if the new matrix (A - \lambda I) has an empty nullspace, it will contain zero eigenvectors. So we desire this new matrix to be singular.

How to accomplish this?  By finding eigenvalues that satisfy the characteristic equation \det(A - \lambda I) = 0. Matrices are singular iff their determinants equal zero.

Let’s work through an example! What is the eigendecompositon for matrix A:

A = \begin{bmatrix} 3 & 1 \\ 1 & 3 \\ \end{bmatrix}

We need to find eigenvalues that solve the characteristic equation.

\det(A - \lambda I) = \begin{vmatrix} 3-\lambda & 1 \\ 1 & 3-\lambda \\ \end{vmatrix} = 0

(3 - \lambda)^2 - 1^2 = \lambda_2 -6\lambda + 8 = (\lambda-2)(\lambda-4) = 0

\lambda_1 = 2, \lambda_2 = 4

Are these eigenvalues correct? Let’s check our work:

trace(A) = 6 = \sum{\lambda_i}

det(A) = 8 = \prod{\lambda_i}

How to find our eigenvectors? By solving the nullspace given each eigenvalue.

For \lambda_1=2 :

A - 2I = \begin{bmatrix} 1 & 1 \\ 1 & 1 \\ \end{bmatrix} \Rightarrow rref(A - 2I) = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ \end{bmatrix}

(\lambda_1, \vec{x}_1) = (2, \begin{bmatrix} 1 \\ -1 \\ \end{bmatrix})

For \lambda_2=4 :

A - 4I = \begin{bmatrix} -1 & 1 \\ 1 & -1 \\ \end{bmatrix} \Rightarrow rref(A - 4I) = \begin{bmatrix} -1 & 1 \\ 0 & 0 \\ \end{bmatrix}

(\lambda_2, \vec{x}_2) = (4, \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix})

Desirable Matrix Properties

The above example was fairly straightforward. But eigendecomposition can “go awry”, as we shall see. Consider a rotation matrix, which in two dimensions has the following form:

R = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \\ \end{bmatrix}

What are the eigenvalues for rotation \theta = 90^{\circ} ?

R = \begin{bmatrix} 0 & -1 \\ 1 & 0 \\ \end{bmatrix}

\det(R - \lambda I) = \begin{bmatrix} - \lambda & -1 \\ 1 & - \lambda \\ \end{bmatrix} = 0

(- \lambda)^2 - 1^2 \Rightarrow \lambda^2 = 1

\lambda_1 = i, \lambda_2 = -i

We can check our work:

trace(R) = 0 = \sum{\lambda_i}

det(R) = 1 = \prod{\lambda_i}

We saw earlier that rotation matrices have no geometric interpretation. Here, we have algebraically shown that its eigenvalues are complex.

A = \left[ \begin{smallmatrix} 3 & 1 \\ 1 & 3 \\ \end{smallmatrix} \right] has real eigenvalues, but R = \left[ \begin{smallmatrix} 0 & -1 \\ 1 & 0 \\ \end{smallmatrix} \right] has less-desirable complex eigenvalues.

We can generalize the distinction between A and R as follows:

Spectral Theorem. Any matrix that is symmetric (A = AT) is guaranteed to have real, nonnegative eigenvalues. The corresponding n eigenvectors are guaranteed to be orthogonal.

In other words, eigendecomposition works best against symmetric matrices.

Eigendecomposition- Spectral Theorem (1)

Diagonalization

Let us place each eigenvector in the column of a matrix S. What happens when you multiply the original matrix A by this new matrix? Since S contains eigenvectors, multiplication by A reduces to multiplication by the associated eigenvalues:

AS = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \lambda_1x_1 & \lambda_2x_2 & \dots & \lambda_nx_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}

We see the product contains a mixture of eigenvalues and eigenvectors. We can separate these by “pulling out” the eigenvalues into a diagonal matrix. Call this matrix \Lambda (“capital lambda”).

AS = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ x_1 & x_2 & \dots & x_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} * \begin{bmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_n \\ \end{bmatrix} = S \Lambda

Most matrices have the property that its eigenvectors are linearly independent. For such matrices, S is invertible. Given this fact, we can solve for A:

\Lambda = S^{-1} A S

A = S \Lambda S^{-1}

Matrices that can be factorized in this way are said to be diagonalizable. We can see that both elimination and eigendecomposition are performing the same type of work: factorizing matrices into their component parts.

If A is symmetric, then we know \Lambda is real, and its eigenvectors in S are orthogonal. Let us rename S to be Q, to reflect this additional property. But orthogonal matrices have the property that transposition equats inversion: Q^T = Q^{-1}. Thus, if A is symmetric, we can simplify the diagonalization formula to:

A = Q \Lambda Q^{-1} = Q \Lambda Q^T

Asymptoptic Interpretations

This diagonalization approach illustrates an important use case of eigenvectors: power matrices. What happens when A is applied arbitrarily many times? What does the output look like in the limit?

We can use the diagonalization equation to represent A^k:

A^k = \prod_{i=1}^{k} (Q^{-1} \Lambda Q) = (Q^{-1} \Lambda Q)(Q^{-1} A Q)\dots(Q^{-1} \Lambda Q)

We can simplify by canceling the inner terms QQ^{-1}:

A^k = Q^{-1} \Lambda^k Q

This equation tells us that the eigenvectors is invariant to how many times A is applied. In contrast, eigenvalue matrix \Lambda has important implications for ongoing processes:

  • If each eigenvalue has magnitude less than one, the output will trend towards zero.
  • If each eigenvalue has magnitude greater than one, the output will trend to infinity.

Fibonacci Eigenvalues

The powers interpretation of eigenvalues sheds light on the behavior of all linear processes. This includes number sequences such as the Fibonacci numbers, where each number is the sum of the previous two numbers.

Recall the Fibonacci numbers are 0,1,1,2,3,5,8,13,... What is F_{100} ?

Eigenvalues can answer this question. We must first express the Fibonacci generator as a linear equation:

F(k+2) = 1F(k+1) + 1F(k)

In order to translate this into a meaningful matrix, we must add a “redundant” equation

F(k+1) = 1F(k+1) + 0F(k)

With these equations, we can create a 2×2 Fibonacci matrix F.

F = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix}

This matrix uniquely generates Fibonacci numbers.

u_1 = Fu_0 = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix} \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix}

u_4 = F^4u_0 = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix}^4 \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} = \begin{bmatrix} 5 \\ 3 \\ \end{bmatrix}

To discover the rate at Fibonacci numbers grow, we decompose F into its eigenvalues:

\det(F - \lambda I) = \begin{vmatrix} 1 - \lambda & 1 \\ 1 & 1- \lambda \\ \end{vmatrix} = 0

 \lambda^2 - 2\lambda - 1 = 0

 \lambda_1 = \frac{1 + \sqrt{5}}{2}, \lambda_2 = \frac{1 - \sqrt{5}}{2}

 \lambda_1 = 1.61803, \lambda_2 = -0.61803

trace(F) = 1 = \sum{\lambda_i}

det(F) = -1 = \prod{\lambda_i}

We can go on to discover eigenvectors x_1 and x_2. We can then express the Fibonnaci matrix F as

F = \lambda_1x_1 + \lambda_2 x_2

F^k = \lambda_1^k x_1 + \lambda_2^k x_2

As k goes to infinity, the second term goes to zero. Thus, the ratio is dominated by the larger eigenvalue, 1.61803.

Mathematicians in the audience will recognize this number as the golden ratio.

eigenvectors-fibonacci-and-golden-ratio

We have long known that the ratio of successive Fibonnaci numbers converges to 1.61803. Eigenvalues provide a mechanism to derive this value analytically.

Until next time.

[Sequence] Algebra

Logic

Set Theory

Abstract Algebra

Linear Algebra

Category Theory

Related sequences

An Introduction to Linear Algebra

Part Of: Algebra sequence
Content Summary: 1300 words, 13 min read.

Linear algebra is the math of vectors and matrices. Let me attempt to explain it as succinctly as possible.

Vector Operations

If n is a positive integer and ℝ is the set of real numbers, then ℝn is the set of all n-tuples of real numbers.  A vector v ∈ ℝn is one such n-tuple. For example,

V = (v_1, v_2, v_3) \in (\mathbb{R}, \mathbb{R}, \mathbb{R}) \equiv \mathbb{R}^3

The six vector operations are: addition, subtraction, scaling, cross product, dot product, norm (vector length). For vectors u and v, these are defined as follows:

linear-algebra-vector-operations-1

The dot product and the cross product norm are related to the angle between two vectors. Specifically,

u \cdotp v = \|u\| \|v\| cos(\theta)

\|u \times v\| = \|u\| \|v\| sin(\theta)

Finally, note the the cross product is not commutative: u x v != v x u.

Matrix Operations

A matrix A ∈ ℝm x n is a rectangular array of real numbers with m rows and n columns. For example,

A= \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ a_{31} & a_{32} \end{bmatrix} \in \begin{bmatrix} \mathbb{R} & \mathbb{R} \\ \mathbb{R} & \mathbb{R} \\ \mathbb{R} & \mathbb{R} \\ \end{bmatrix} \equiv \mathbb{R}^{3x2}

Important matrix operations are addition, subtraction, product, transpose, and determinant:

linear-algebra-matrix-operations-2

Of the above operations, matrix product is especially significant. This operation is only possible when A has the same number of columns as B has rows.

Matrix as Morphism

Importantly, matrices with one row or one column can interpreted as vectors. Matrix multiplication with a vector can be interpreted in two ways. The row picture interprets the output as a dot product, the column picture interprets the output as a linear combination.

linear-algebra-row-vs-column-picture-4

Let’s take A to be a 2 by 2 matrix with row-vectors (2, -1) and (-1,2), and let b = (0,3). When we solve for x, we find there are two ways to do so:

linear-algebra-row-vs-column-geometry-1

In the above example, we set b = (0,3). But this linear combination could generate any vector b. In general, linear combinations can generate the entirety of a space (in this case, a two-dimensional plane). 

In both views, we say A transform x → b. If vectors are groups of data, matrices are functions that operate on vectors. Specifically, multiplication by a matrix A ∈ ℝm x n we will call a linear transformation, that converts an n-tuple into a m-tuple TA : ℝn → ℝm.

The symmetry between functions and linear transformations runs deep. When we explore category theory, we will discover the structure underpinning this relationship. But in the meantime, consider the following equivalences:

linear-algebra-function-vs-linear-transformation-4

Especially important to matrix functions is the notion of inverse matrix. Just as f-1(x) = ln(x) undoes the effects of f(x) = ex, inverse matrix A-1 undoes the effects of A. The cumulative effect of applying A-1 after A is the identity matrix A-1A = 1. The identity matrix is analogous to multiplying by one, or adding zero: it is algebraically inert.

Only square matrices can be invertible, because reversal is commutative AA-1 = A-1A = 1. But not all square matrices have inverses. The determinant of a matrix checks for invertibility. Det(A) != 0 iff A is invertible.

Matrix inversion is vitally important to solving systems of linear equations.

  • AX = B → multiply both sides by A-1 on the right → X = A-1B
  • XA = B → multiply both sides by A-1 on the right → X = BA-1
  • ABXC = E → put C-1 on the right and B-1A-1 on the left → X = B-1A-1EC-1

Elimination As Inverse Solver

To solve Ax = b, we must find inverse matrix A-1. This can be done via Gauss-Jordan Elimination. This method affords three row operations:

  1. Linear Combination: Ri += kRj
  2. Scaling: Ri *= k
  3. Row Swap: Ri ⇆ Rj

After creating a matrix of the form Ax = b, we can solve for x by creating an augmented matrix of the form [ A | b ], and converting the left-hand side into the identity matrix:

linear-algebra-gauss-jordan-elimination-2

To solve for x algebraically, Ax = b → A-1Ax = A-1b → 𝟙x = A-1b. So Gauss-Jordan facilitates the discovery of inverse matrix A-1. We can show this computation explicitly, by setting an augmented matrix of form [ A | 1 ]

linear-algebra-gauss-jordan-inverse-matrix-6

Row operations are functions that act on vectors. So are matrices. Thus, it isn’t very surprising that our three row operations each correspond to an elementary matrix. The elementary matrix is similar to the identity matrix; in the below picture, only the shaded region diverges. 

linear-algebra-elementary-matrices-7

We now see that the row operations of Gauss Jordan elimination are a kind of factoring: we are finding the elementary matrices underlying A-1:

linear-algebra-gauss-jordan-elementary-matrices-2

Thus E4E3E2E1 = A-1 and Ax = b → (E4E3E2E1)Ax = (E4E3E2E1)b → x = A-1b

Fundamental Vector Spaces

A vector space consists of a set of vectors and all linear combinations of these vectors. The set of all linear combinations is called the span. For example, the vector space S = span{ v1, v2 } consists of all vectors of the form v = ɑv1 + βv2, where ɑ and β are real numbers.

Vector spaces can be characterized by a basis: the original set of vectors whose linear combination creates the vector space. The vectors of any basis must be linearly independent, or orthogonal. You can check whether two vectors or orthogonal by confirming their dot product u · v = 0. The dimension of a vector space is the cardinality of the basis (number of orthogonal vectors).

Recall that matrices are functions that project vectors from ℝn to ℝm. This corresponds to projecting from row space to column space, as follows:

linear-algebra-fundamental-space-interpretation-6

We will unpack the full implications of this graphic another time. For now, it suffices to understand that discovering the bases of these subspaces is a useful exercise.

For example, if we show that a matrix has an non-empty kernel, this in itself is proof of non-invertibility. Why? Because if TA sends a vector to the zero vector, there is no TA-1 that can undo this operation.

We can say more. For an n x n matrix A, the following statements are equivalent:

  1. A is invertible
  2. The determinant of A is nonzero.
  3. The RREF of A is the n x n identity matrix
  4. The rank of the matrix is n
  5. The row space of A is ℝn
  6. The column space of A is ℝn
  7. A doesn’t have a null space (only the zero vector N(A) = {0} )

Elimination as Basis Identifier

We have seen elimination discover (and factorize) the matrix inverse. But what happens when an inverse does not exist? Well, elimination has a natural, unique stopping point:

linear-algebra-rref-derivation-2

Matrices in reduced row echelon form (RREF) have the following properties:

  1. Rows with all zeroes are moved to the bottom.
  2. The first nonzero number from the left (pivot) is to the right of the pivot above it.
  3. Every pivot equals 1 and is the only nonzero entry in its column

Gauss-Jordan Elimination can replace every matrix with an equivalent one in rref form. Note that we have seen several examples of elimination creates the identity matrix 1  when A is invertible. This is only possible because 1 fits the rref criteria.

Once a matrix is in rref form, it becomes trivial to discover the basis for its three fundamental spaces:

  • R(A) basis: all row vectors that are non-zero
  • C(A) basis: all column vectors that contain pivot.
  • N(A) basis: the solution to Ax = 0

linear-algebra-deriving-fundamental-spaces-2

We see that rank(A) + nullity(A) = 3 + 1 = 4, which indeed is the number of columns. Also, the column space of A equals the row space of AT. This is not an accident, since transpose simply swaps rows with columns.

Takeaways

  • Vectors are groups of numbers; matrices are functions that act on vectors.
  • We can solve linear equations by conducting Gauss-Jordan elimination.
  • Gauss-Jordan elimination-based solution rely on searching for inverse matrix A-1
  • The domain of matrices is its row vectors, its codomain is its column vectors.
  • Even if a matrix is not invertible, elimination can find its most reduced form (RREF).
  • RREF matrices can be used to derive fundamental subspaces.

Until next time.

Related Works

Algorithmic Dark Matter: Duality in Linear Programming

Part Of: Optimization sequence
Content Summary: 800 words, 8 min read.

Today, I introduce the concept of duality. Buckle your seat belts! 🙂

Max Flow algorithm

The problem of flow was originally studied in the context of the USSR railway system in the 1930s. The Russian mathematician A.N. Tolstoy published his Methods of finding the minimal total kilometrage in cargo-transportation planning in space, where he formalized the problem as follows.

We interpret transportation graphically. Vertices are interpreted as cities, and edges are railroads connecting two cities. The capacity of each edge was the amount of goods that particular railroad could transport in a given day. The bottleneck is solely the capacities, and not production and consumption. We assume no available storage at the intermediate cities.

The flow allocation problem defines source and termination vertices, s and t. We desire to maximize volume of goods transported s → t. To do this, we label each edge with the amount of goods we intend to ship on that railroad. This quantity, which we will call flow, must respect the following properties:

  • Flow cannot exceed the capacity of the railroad.
  • Flow is conserved: stuff leaving a city must equal the amount of stuff arriving.

Here are two possible solutions to flow allocation:

duality-two-flow-solutions-1

Solution B improves on A by pushing volume onto the b → c railway. But are there better solutions?

To answer rigorously, we formalize max flow as a linear optimization problem:

duality-max-flow-as-linear-optimization-1

The solution to LP tells us that no, eight is the maximum possible flow.

Min Cut algorithm

Consider another, seemingly unrelated, problem we might wish to solve: separability. Let X ⊂ E represent the number of edges you need to remove to eliminate a connection s → t. Here are two such solutions:

Duality- Two Cut Solutions (1).png

Can we do better than B? The answer is no: { (b,t), (c,d) } is the minimum cut possible in the current graph. In linear programming terms, it is the Best Feasible Solution (BFS)

Note that the BFS of minimum cut and the BFS of max flow arrive at the same value. 8 = 8. This is not a coincidence. In fact, these problems are intimately related to one another. What the min cut algorithm is searching for is the bottleneck: the smallest section of the “pipe” from s → t. For complex graphs like this, it is not trivial to derive this answer visually; but the separability algorithm does the work for us.

The deep symmetry between max flow and min cut demonstrates an important mathematical fact. All algorithms come in pairs. For this example, we will call max flow and min cut the primal and dual problems, respectively. We will explore the ramifications for this another time. For now, let’s approach duality from an algebraic perspective.

Finding LP Upper Bound

Consider a linear program with the following objective function:

\max (2x_1 + x_2)

And these constraints

4x_1 + x_2 \leq 6

x_1 + 2x_2 \leq 5

x_1, x_2 \geq 0

This program wants to find the largest solution possible given constraints. Can we provide an upper bound on the solution?

Yes. We can immediately say that the solution is no greater than 6. Why? The objective function, 2x_1 + x_2 is always smaller than 4x_1 + x_2, because we know all variables are positive. So we have an upper bound OPT ≤ 6. We can sharpen this upper bound, by comparing the objective function to other linear combinations of the constraints.  

duality-generalizing-search-for-upper-bounds-4

Different weights to our linear combinations produce different upper bounds:

  • (1,0) → 6
  • (0,1) → 5
  • (⅓, ⅓ ) → 3.67

Let us call these two weights (y_1, y_2). What values of these variables give us the smallest upper bound? Importantly, this is itself an objective function: \min (6y_1 + 5y_2).

But y_1 and y_2 are constrained: they must produce an equation that exceeds 2x_1 + x_2. Thus,

y_1(a) + y_2(b) \geq 2x_1 + x_2

y_1 \left( 4x_1 + x_2 \right) + y_2 \left( 3x_1 + 2x_2 \right) \geq 2x_1 + x_2

\left(4y_1 + 3y_2 \right) x_1 + \left (y_1 + 2y_2 \right) x_2 \geq 2x_1 + x_2

(4y_1 + 3y_2) \geq 2 and (y_1 + 2y_2) \geq 1

This gives us our two constraints. Thus, by looking for the lowest upper bound on our primal LP, we have derived our dual LP:

duality-primal-vs-dual-example-4

Note the extraordinary symmetry between primal and dual LPs. The purple & orange values are mirror images of one another. Further, the constraint coefficient matrix has transposed (the 3 has swapped along the diagonal). This symmetry is reflected in the above linear algebra formulae. 

A Theory of Duality

Recall that linear programs have three possible outcomes: infeasible (no solution exists), unbounded (solution exists at +/-∞) or feasible/optimal. Since constraints are nothing more than geometric half-spaces, these possible outcomes reflect three kinds of polyhedra:

duality-three-possible-outcomes

The outcome of primal and dual programs are predictably correlated. Of the nine potential pairings, only four can actually occur:

  1. Both P and D are infeasible
  2. P is unbounded and D is infeasible
  3. D is unbounded and P is infeasible
  4. Both are feasible, and there exist optimal solutions

Finally, in the above examples, we saw that the optimal dual value p^* = d^* (8=8). But this is not always the case. In fact, the optimal dual value can be smaller .

We can distinguish between two kinds of duality:

  • Strong duality, where p^* = d^*
  • Weak duality, where p^* - d^* \geq 0

Takeaways

Today, we have illustrated a deep mathematical fact: all problems come in pairs. Next time, we will explore the profound ramifications of this duality principle.

Related Resources: CMU Lecture 5: LP Duality

An Introduction to Probability Theory

Part Of: Statistics sequence
Related To: An Introduction to Set Theory
Content Summary: 400 words, 4 min read.

“Probability theory is nothing but common sense reduced to calculation.” – Laplace

Introducing Probability Theory

Probability theory, as formulated by Andrey Kolmogorov in 1925, has two ingredients:

  1. A space which define the mathematical objects (“the nouns”)
  2. Axioms which define the mathematical operations (“the verbs”)

A probability space is a 3-tuple (Ω,𝓕,P):

  1. Sample Space (Ω): A set of possible outcomes, from one or more events. Outcomes in Ω must be mutually exclusive and collectively exhaustive.
  2. σ-Algebra (𝓕). A collection of event groupings, or subsets. If Ω is countable, this can simply be the power set, otherwise a Borel algebra is often used.
  3. Probability Measure Function (P). A real-valued function P: Ω → ℝ which maps from events to real numbers.

The Kolmogorov axioms provide “rules of behavior” for the residents of probability space:

  1. Non-negativity: probabilities can never be negative, P(x) >= 0.
  2. Unitarity: the sum of all probabilities is 1.0 (“something has to happen”)
  3. Sigma Additivity: the probability of composite events equals the sum of their individual probabilities.

probability-structural-overview-3

Random Variables

A random variable is a real-valued function X: Ω → ℝ. A random variable is a function, but not a probability function. Rather, instantiating random variables X = x defines a subset of events ⍵ ∈ Ω such that X(⍵) = x. Thus x picks out the preimage of Ω. Variable instantiation thus provides a language to select groups of events from Ω.

Random variables with discrete outcomes (countably finite Ω) are known as discrete random variable. We can define probability mass functions (PMFs) such that

f_X(x) = P(X=x) = P( { \omega \in \Omega : X(\omega) = x } )

In contrast, continuous random variables have continuous outcomes (uncountable Ω). For this class of variable, the probability of any particular event is undefined. Instead, we must define probabilities against a particular interval. The probability of 5.0000000… inches of snow is 0%; it is more meaningful to discuss the probability of 5 ± 0.5 inches of snowfall. Thus, we define probability density functions (PDFs) such that:

P[a \leq X \leq b] = \int f_X(x) dx

We can summarize discrete PMFs and continuous PDFs in the following graphic:

probability-pmf-vs-pdf-1

Marginal Probabilities

Consider two random variables, A and B ∈ Ω. Several operators may act on these variables, which parallel similar devices in Boolean algebra and set theory.

probability-operators

Suppose we want to know the probability of either A or B occurring. For this, we rely on the Set Combination Theorem:

probability-set-combination-4

Union involves subtracting the intersection; else the purple region is counted twice. In our post on set theory, we saw this same idea expressed as the inclusion-exclusion principle (Definition 13).

Summary

This first post in a two part explored the first six concepts or probability theory. Next time, we will learn about concepts 7-12.

probability-theory-theorems-4

These definitions and theorems are the cornerstone upon which much reasoning are built. It pays to learn them well.

Related Work

Error Correcting Codes

Part Of: Information Theory sequence
Followup To: An Introduction To Communication
Content Summary: 900 words, 9 min read

Motivations

Information theory was first conceived as a theory of communication. An employee of Bell Labs, Shannon was interested in the problem of noise. Physical imperfections on your data cable, for example, can distort your signal. 

communication-general-architecture

How to protect against noise? One answer is to purchase better equipment.

But there is another, less expensive answer. Errors can be reduced by introducing redundancy into their signals. But redundancy has a price: it also slows the rate of communication.

Shannon strove to understand this error vs rate trade-off quantitatively. Is error-free communication possible? And if so, how many extra bits would that require?

Before we can appreciate Shannon’s solution, it helps to understand the basics of error correcting codes (ECC). We turn first to a simple ECC: replication codes.

Replication Codes

Error correcting codes have two phases: encoding (where redundancy is injected) and decoding (where redundancy is used to correct errors)

Consider repetition code R3, with the following encoding scheme:

0 → 000
1 → 111

Decoding occurs via majority rule. If most bits are zero the original bit is a 0, and vice versa. Explicitly:

000, 001, 010, 100 → 0
111, 110, 101, 011 → 1

Suppose I want to communicate the word “red” to you. To do this, I encode each letter into ASCII and send those bits over the Internet, to be consumed by your Internet browser. Using our simplified ASCII: ‘r’ → 10010. To protect from noise, we repeat each bit three times: 10010 → 111000000111000.

Let us imagine that 6 bits are flipped: 101000110001100. The last error caused the code 000 → 100, which the decoding algorithm successfully ignores. But the first two noise events caused 111 → 010, which causes a decoding error. These decoding errors in turn change the ASCII interpretation. This pattern of noise will cause you to receive the word “fed”, even though I typed “red”.

ecc-replication-code-lifecycle

Analysis of Repetition Codes

Let us formalize the majority rule algorithm. Decoding involves counting the number of bit flips required to restore the original message. We express this number as the Hamming distance. This can be visualized graphically, with each bit corresponding to a dimension. On this approach, majority rule becomes a nearest neighbor search.

hamming_distance

What is the performance of R3? Let us call noise probability p, the chance of any one bit flip. We can model the total number of errors with the binomial distribution:

ecc-binomial-distribution

In R3, a decoding error requires two bit flips within a 3-bit block. Thus, P(error) = P(r >= 2|p, 3). We can then compute the frequency of decoding errors.

ecc-replication-error-rate-analysis

For p = 10% error rate, we can graph R3 vs R5 performance:

ecc-rate-vs-error-diagram-2

Hamming Codes

In error correcting codes, we distinguish redundant bits versus message bits. Replication codes have a very straightforward mapping between these populations. However, we might envision schemes where each redundant bit safeguards multiple message bits.

Consider the Hamming code, invented in 1950. Richard Hamming shared an office with Claude Shannon for several years. He is also the father of Hamming distance metric, discussed above.

Hamming codes split messages into blocks, and add redundant bits that map to the entire block. We will be examining the H(7,4) code. This code takes 4-bit blocks, and adds three redundant bits at the end. Each redundant bit then “maps to” three of the message bits:

ecc-repetition-vs-hamming-5

We now define the encoding and decoding algorithms for Hamming Codes. Let message bits s ∈ S be organized into blocks. Let each redundant bit t ∈ T map to a particular subset of the block. For example, s(t7) → { s3, s4, s1 }.

For H(7,4) code, each redundant bit maps to a set of three message bits. If there are an odd number of 1s in this set, the redundant bit is set to 1. Otherwise, it is set to zero. Binary addition (eg., 1+1=0) allows us to express this algebraically:

ti = ∑ s(ti)

A Venn diagram helps to visualize this encoding:ecc-hamming-code-encoding-1

H(7,4) decoding is complicated by the fact that both message bits and redundant bits can be corrupted by noise. We begin by identifying whether there are any inequalities between ti versus ∑ s(ti). If there is, that means *something* has gone wrong.

For H(7,4), there are three redundant bits in every block; there are thus eight different syndromes (error signatures). The decoding algorithm must respond to each possible syndrome, as follows:

ecc-hamming-code-decoding-1

The error rate for H(7,4) is calculated as follows. 

ECC- Rate vs Error Repetition and Hamming (1).png

Application To Information Theory

This example demonstrates that Hamming codes can outperform replication codes. By making the structure of the redundant bits more complex, it is possible to reduce noise while saving space.

Can error-correcting codes be improved infinitely? Or are there limits to code performance, besides human imagination? 

Shannon proved that there are limits to error-correcting codes. Specifically, he divides the error-rate space into attainable and unattainable regions:

ecc-rate-vs-error-unattainable-region

Another important result of information theory is the design of decoding algorithms. Decoding H(7,4) is perhaps the most complex algorithm presented above. For more sophisticated ECCs, these decoding tasks become even more difficult.

Information theory provides a principled way to automate the discovery of optimal decoding algorithms. By applying the principle of maximum entropy, we can radically simplify the process of designing error-correcting codes.

We will explore these ideas in more detail next time. Until then!