Logic Inference: Sequent Calculus

Part Of: Logic sequence
Followup To: Natural Deduction
Content Summary: 600 words, 6 min read

Motivating Sequent Calculus

Last time, we labelled propositions in the language of verification.

  • ↑ represents conjecture: propositions that require verification
  • ↓ represents assumption: propositions that can be used for verification.

Two of our connective rules (⊃I and ∨E) expanded our set of assumptions, which we could use at any later time. Logic acumen is invoking the right assumption at the right time.

In contrast to natural deduction, sequent calculus explicitly tracks the set of assumptions) as they vary across different branches of the proof tree.

We will use the turnstile to distinguish assumptions from conjecture: { assumptions } ⊢ { conjectures }

In natural deduction, progress in bidirectional: we are done when we found a connection between assumptions and conjecture. In sequent calculus, progress is unidirectional. Instead, we start with no assumptions, and finish when we have no conjectures left to demonstrate.

Sequence Calculus- Different Schematics (1)

Both logical systems rely on two sets of five rules. They bear the following relationships:

  • R = I. Right rules are very similar to Introduction rules.
  • L = E-1. Left rules must be turned “upside down”.

Right and Left rules

We here define capital gamma Γ to represent the context, or current set of assumptions.

Right rules simply preface Introduction rules with  “Γ ⊢”. The exception ⊃R is instructive. There, A is added to the context, and our “target” conjecture shrinks to just B.
Sequent Calculus- Right vs Introduction (2)

Left rules are less transparently related to Elimination. They are more easily understood by an English explanation:

Sequent Calculus- Left Rule English Interpretation (2)

The entire structure of sequent calculus, then, looks like this:

Sequent Calculus- Left and Right Rules (1)

Enough theory! Let’s use sequent calculus to prove stuff.

Example 1: Implication

Show that (A ⊃ (B ⊃ C)) ⊃ ((A ⊃ B) ⊃ (B ⊃ C)).

Here, ⊃R serves us well:

Sequent Calculus- Implication Step0 (1)

We have parsed the jungle of connectives, and arrived at a clear goal. We need to prove C. How?

Recall what ⊃L means: “if you have assumed A ⊃ B, you may also assume B (right branch) if you can prove A with your current assumptions (left branch).

Let’s apply ⊃L to the A ⊃ B proposition sitting in our context. To save space, let us here define Γ with the following three elements: { A⊃(B⊃C), A⊃B, and A }.

Sequent Calculus- Implication Step1.png

We can solve the left branch immediately. Since A ∈ Γ, we can invoke the hyp rule.

Unfortunately, assuming B is not enough to prove C. We must invoke ⊃L again, this time against our A⊃(B⊃C) assumption.

Sequent Calculus- Implication Step2

And again, on our newfound B⊃C assumption.

Sequent Calculus- Implication Step3.png

Wait! By now our context by now contains A, B, and C. Each leaf of the proof tree is provable by hyp.

Sequent Calculus- Implication Step4

QED. It is instructive to compare this sequent calculus proof with the analogous natural deduction (which we solved together, last time).

Sequent Calculus- Implication Comparing Proofs

Example 2: Distributivity

Show that (A ∨ (B ∧ C)) ⊃ ((A ∨ B)  ∧ (A ∨ C)).

The first two steps here are straightforward. Simplify the conjecture string!

Sequent Calculus- Distributivity Step0 (1)

Note that Γ = { A ∨ (B ∧ C) }. Here, we use ∨L to split this assumption into two components:

Sequent Calculus- Distributivity Step1 (2)

We now have four conjectures to prove. Fortunately, each proof has become trivial:

Sequent Calculus- Distributivity StepF

QED.

Takeaways

In this post, we introduced sequent calculus (SC) as an alternative deductive calculus. Sequent calculus makes the notion of context (assumption set) explicit: which tends to make its proofs bulkier but more linear than the natural deduction (ND) style. The two approaches share several symmetries: SC right rules correspond fairly rigidly to ND introduction rules, for example.

If you want to learn sequent calculus for yourself, I recommend solving the converse problems to the two examples above. Specifically,

  • Given (A ⊃ B) ⊃ (B ⊃ C), show that A ⊃ (B ⊃ C).
  • Given (A ∨ B) ∧ (A ∨ C), show  that A ∨ (B ∧ C).

Until next time!

 

Logic Inference: Natural Deduction

Part Of: Logic sequence
Content Summary: 500 words, 5 min read

Introduction

Logical systems like IPL have the following ingredients:

  • proposition is an atomic statement that can acquire a truth value.
  • connective takes atomic propositions, and melds them into a composite.

We can label propositions in the language of verification.

  • ↑ represents conjecture: propositions that require verification
  • ↓ represents knowledge: propositions that can be used for verification.

Introduction and elimination rules can be expressed in this language:

Logic Metalanguage- Original Rules (1)

Elimination rules tend to “point down”; introduction rules point up. Roughly, deduction involves applying such rules until the paths meet:

IPL Inference- Schematic

Enough theory! Let’s see how this works in practice.

Exercise One: Implication Exploration

Given A ⊃ (B ⊃ C), show that  (A ⊃ B) ⊃ (B ⊃ C).

We can visualize the challenge as follows. The red line indicates common knowledge.

IPL Inference- Implication Exploration Step0

First, let’s apply elimination on the premises:

IPL Inference- Implication Exploration Step1

Next, let’s apply introduction on the conclusion:

IPL Inference- Implication Exploration Step2.png

Are we done? No: we have not verified A↑ and B↑. If we had, they would have a red line over them.

To finish the proof, we need to invoke our introduction-rule assumptions.

IPL Inference- Implication Exploration Step3

Proving A↑ is trivial. Proving B↑ requires combining assumptions via elimination.

IPL Inference- Implication Exploration Step4 (2)

Done. 🙂 Good work!

Exercise for the Reader

Prove the converse is true. Given (A ⊃ B) ⊃ (B ⊃ C), show that A ⊃ (B ⊃ C).

Example 2: Distributivity

In arithmetic, distributivity refers to how addition and multiplication can interleave with one another. It requires that a + (b * c) = (a*b) + (a*c). For example:

  • 2 * (4+5) = 2 * 9 = 18
  • (2*4) + (2*5) = 8 + 10 = 18

Are logical conjunction and disjunction distributive? Let’s find out!

IPL Inference- Distributivity Exploration Step0 (1)

First, let’s introduce conjunction on the conclusion.

IPL Inference- Distributivity Exploration Step1

Here we reach an impasse. We need to introduce disjunction elimination on the premise. But what should we choose for C?

Let’s set C = A or B.

IPL Inference- Distributivity Exploration Step2.png

Filling in the gaps is straightforward. On the right, we eliminate conjunction and retain B. Then we introduce disjunction on both sides.

IPL Inference- Distributivity Exploration Step2 (1)

Here is where I originally got stuck. How can we use disjunction elimination?

The way forward becomes easier to grasp, when you remember:

  • We can use knowledge as many times as we like.
  • The symbols in the rule schematics are arbitrary.

Let’s set the arbitrary elimination symbol “C”  equal to A or C:

IPL Inference- Distributivity Exploration Step4 (1)

From here, the solution is straightforward.

IPL Inference- Distributivity Exploration Step5 (1)

Exercise for the Reader

Prove the converse is true. Given (A ⊃ B) ⊃ (B ⊃ C), show that A ⊃ (B ⊃ C).

Takeaways

In this post, we saw worked examples of deduction. Specifically:

  • Given A ⊃ (B ⊃ C),  show that  (A ⊃ B) ⊃ (B ⊃ C).
  • Given A ∨ (B ∧ C), show that (A ∨ B) ∧ (A ∨ C).

The best way to learn is practice. For the interested reader, I recommend these exercises:

  • Given (A ⊃ B) ⊃ (B ⊃ C), show that A ⊃ (B ⊃ C).
  • Given (A ∨ B) ∧ (A ∨ C), show  that A ∨ (B ∧ C).

In the latter exercise, you must also “get creative” on how to use disjunction elimination. Instead of choosing an arbitrary C, you must set A^B to a useful value.

… still stuck? Okay, see solution here. 🙂

Until next time.

 

Five Tribes of Machine Learning

Part Of: Machine Learning sequence
Content Summary: 900 words, 9 min read

ML is tribal, not monolithic

Research in artificial intelligence (AI) and machine learning (ML) has been going on for decades. Indeed, the textbook Artificial Intelligence: A Modern Approach reveals a dizzying variety of learning algorithms and inference schemes. How can we make sense of all the technologies on offer?

As argued in Domingos’ book The Master Algorithm, the discipline is not monolithic. Instead, five tribes have progressed relatively independently. What are these tribes?

  1. Symbolists use formal systems. They are influenced by computer science, linguistics, and analytic philosophy.
  2. Connectionists use neural networks. They are influenced by neuroscience.
  3. Bayesians use probabilistic inference. They are influenced by statistics.
  4. Evolutionaries are interested in evolving structure. They are influenced by biology.
  5. Analogizers are interested in mapping to new situations. They are influenced by psychology.

Expert readers may better recognize these tribes by their signature technologies:

  • Symbolists use decision trees, production rule systems, and inductive logic programming.
  • Connectionists rely on deep learning technologies, including RNN, CNN, and deep reinforcement learning.
  • Bayesians use Hidden Markov Models, graphical models, and causal inference.
  • Evolutionaries use genetic algorithms, evolutionary programming, and evolutionary game theory.
  • Analogizers use k-nearest neighbor, and support vector machines.

Five Tribes- Strengths and Technologies

In fact, my blog can be meaningfully organized under this research landscape.

History of Influence

Here are some historical highlights in the development of artificial intelligence.

Symbolist highlights:

  • 1950: Alan Turing proposes the Turing Test in Computing Machinery & Intelligence.
  • 1974-80: Frame problem & combinatorial explosion caused First AI Winter.
  • 1980: Expert systems & production rules re-animate the field. 
  • 1987-93: Expert systems too fragile & expensive, causing the Second AI Winter.
  • 1997: Deep Blue defeated reigning chess world champion Gary Kasparov.

Connectionist highlights:

  • 1957: Perceptron invented by Frank Rosenblatt.
  • 1968: Minsky and Papert publish the book Perceptrons, criticizing single-layer perceptrons. This puts the entire field to sleep, until..
  • 1986: Backpropagation invented, and connectionist research restarts.
  • 2006: Hinton et al publish A fast learning algorithm for deep belief nets, which rejuvinates interest in Deep Learning.
  • 2017: AlphaGo defeats reigning Go world champion, using DRL.

Bayesian highlights:

  • 1953: Monte Carlo Markov Chain (MCMC) invented. Bayesian inference finally becomes tractable on real problems.
  • 1968: Hidden Markov Model (HMM) invented.
  • 1988: Judea Pearl authors Probabilistic Reasoning in Intelligent Systems, and creates the discipline of probabilistic graphical models (PGMs).
  • 2000: Judea Pearl authors Causality: Models, Reasoning, and Inference, and creates the discipline of causal inference on PGMs.

Evolutionary highlights

  • 1975: Holland invents genetic algorithms.

Analogizer highlights

  • 1968: k-nearest neighbor algorithm increases in popularity.
  • 1979: Douglas Hofstadter publishes Godel, Escher, Bach.
  • 1992: support vector machines (SVMs) invented.

We can summarize this information visually, by creating an AI version of the Histomap:

Five Tribes- Historical Size and Competition (2)

These data are my own impression of AI history. It would be interesting to replace it with real funding & paper volume data.

Efforts Towards Unification

Will there be more or fewer tribes, twenty years from now? And which sociological outcome is best for AI research overall?

Theory pluralism and cognitive diversity are underappreciated assets to the sciences. But scientific progress is often heralded by unification. Unification comes in two flavors:

  • Reduction: identifying isomorphisms between competing languages,
  • Generalization: creating a supertheory that yields antecedents as special cases.

Perhaps AI progress will mirror revolutions in physics, like when Maxwell unified theories of electricity and magnetism.

Symbolists, Connectionists, and Bayesians suffer from a lack of stability, generality, and creativity, respectively. But one tribe’s weakness is another tribe’s strength. This is a big reason why unification seem worthwhile.

What’s more, our tribes possesses “killer apps” that other tribes would benefit from. For example, only Bayesians are able to do causal inference. Learning causal relations in logical structure, or in neural networks, are important unsolved problems. Similarly, only Connectionists are able to explain modularity (function localization). Symbolist and Bayesian tribes are more normative than Connectionism, which makes their technologies tend towards (overly?) universal mechanisms.

Symbolic vs Subsymbolic

You’ve heard of the symbolic-subsymbolic debate? It’s about reconciling Symbolist and Connectionist interpretations of neuroscience. But some (e.g., [M01]) claim that both theories might be correct, but at different levels of abstraction. Marr [M82] once outlined a hierarchy of explanation, as follows:

  • Computational: what is the structure of the task, and viable solutions?
  • Algorithmic: what procedure should be carried out, in producing a solution?
  • Implementation: what biological mechanism in the brain performs the work?

One theory, supported by [FP98] is that Symbolist architectures (e.g., ACT-R) may be valid explanations, but somehow “carried out” by Connectionist algorithms & representations.

Five Tribes- Tribes vs Levels (2)

I have put forward my own theory, that Symbolist representations are properties of the Algorithmic Mind; whereas Connectionism is more relevant in the Autonomic Mind.

This distinction may help us make sense for why [D15] proposes Markov Logic Networks (MLN) as a bridge between Symbolist logic and Bayesian graphical models. He is seeking to generalize these technologies into a single construct; in the hopes that he can later find a reduction of MLN in the Connectionist paradigm. Time will tell.

Takeaways

Today we discussed five tribes within ML research: Symbolists, Connectionists, Bayesians, Evolutionaries, and Analogists. Each tribe has different strengths, technologies, and developmental trajectory. These categories help to parse technical disputes, and locate promising research vectors.

The most significant problem facing ML research today is, how do we unify these tribes?

References

  • [D15] Domingos (2015). The Master Algorithm
  • [M01] Marcus (2001). The Algebraic Mind
  • [M82] Marr (1982). Vision
  • [FP98] Fodor & Pylyshyn (1998). Connectionism and cognitive architecture: A critical analysis

OLS Estimation via Projection

Part Of: Machine Learning sequence
Content Summary: 800 words, 8 min read

Projection as Geometric Approximation

If we have a vector b and a line determined by vector a, how do we find the point on the line that is closest to b?

OLS- Geometry of Projection

The closest point p is at the intersection formed by a line through b that is orthogonal to a. If we think of p as an approximation to b, then the length of e = b - p is the error of that approximation.

a^T e = a^T (b - ax) = 0

This formula captures projection onto a vector. But what if you want to project to a higher dimensional surface?

Imagine a plane, whose basis vectors are a_1 and a_2. This plane can be described with a matrix, by mapping the basis vectors onto its column space:

A = \begin{bmatrix} a_1 & a_2 \end{bmatrix}

Suppose we want to project vector b onto this plane. We can use the same orthogonality principle as before:

A^Te = A^T(b-Ax) = 0

A^TAx = A^Tb

Matrices like A^TA are self-transpositions. We have shown that such matrices are square symmetric, and thereby contain positive, real eigenvalues.

We shall assume that the columns of A^TA are independent, and it thereby is invertible. The inverse thereby allows us to solve for x:

(A^TA)^{-1}(A^TA)x = (A^TA)^{-1}A^Tb

x = (A^TA)^{-1}A^Tb

Recall that,

p = Ax = A(A^TA)^{-1}A^Tb

Since matrices are linear transformations (functions that operate on vectors), it is natural to express the problem in terms of a projection matrix P, that accepts a vector b, and outputs the approximating vector p:

p = Pb

By combining these two formula, we solve for P:

P = A(A^TA)^{-1}A^T

Thus, we have two perspectives on the same underlying formula:

OLS- Regression Functions via Matrices (1)

Linear Regression via Projection

We have previously noted that machine learning attempts to approximate the shape of the data. Prediction functions include classification (discrete output) and regression (continuous output).

Consider an example with three data points. Can we predict the price of the next item, given its size?

OLS- Regression Function Setup

For these data, a linear regression function will take the following form:

\psi : Size \rightarrow Price

\psi(Size) = \beta_0 + \beta_1 Size

We can thus interpret linear regression as an attempt to solve Ax=b:

OLS- Linear Algebra Regression Matrix

In this example, we have more data than parameters (3 vs 2). In real-world problems, it is an extremely common predicament. It yields matrices with may more equations than unknowns. This means that Ax=b has no solution (unless all data happen to fall on a straight line).

If exact solutions are impossible, we can still hope for an approximating solution. Perhaps we can find a vector p that best approximates b. More formally, we desire some p = A\bar{x} such that the error e = b-p is minimized.

Since projection is a form of approximation, we can use a projection matrix to construct our linear prediction function \psi : Size \rightarrow Price.

OLS- Least Squares Fundamental Spaces

A Worked Example

The solution is to make the error b-Ax as small as possible. Since Ax can never leave the column space, choose the closest point to b in that subspace. This point is the projection p. Then the error vector e = b-p has minimal length.

To repeat, the best combination p = Ax is the projection of b onto the column space. The error is perpendicular to that subspace. Therefore e = b-p is in the left nullspace:

Ax = b

A^TA = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \\ \end{bmatrix}

We can use Guass-Jordan Elimination to compute the inversion:

(A^TA)^{-1} = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \\ \end{bmatrix}

A useful intermediate quantity is as follows:

(A^TA)^{-1}A^T = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \\ \end{bmatrix} = \begin{bmatrix} 4/3 & 1/3 & -2/3 \\ -1/2 & 0 & 1/2 \\ \end{bmatrix}

We are now able to compute the parameters of our model, \bar{x}:

\bar{x} = \left[ (A^TA)^{-1}A^T \right] b = \begin{bmatrix} 4/3 & 1/3 & -2/3 \\ -1/2 & 0 & 1/2 \\ \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 2/3 \\ 1/2 \\ \end{bmatrix}

These parameters generate a predictive function with the following structure:

\psi : Size \rightarrow Price

\psi(Size) = \frac{2}{3} + \frac{1}{2}Size

These values correspond with the line that best fits our original data!

Wrapping Up

Takeaways:

  • In linear algebra, projection approximates a high-dimensional surface in a lower-dimensional space. The projection error can be measured.
  • In linear regression, we usually cannot solve Ax=b, because there tends to be more data than parameters (b is not in the column space)
  • We can find the closest vector in the column space by projecting onto b, and minimizing the projection error.
  • Thus, the operation of projection can be used to perform parameter estimation, and produce a model that best approximates the training data.

Related Resources:

Regression vs Classification

Part Of: Machine Learning sequence
Content Summary: 500 words, 5 min read

Motivations

Data scientists are in the business of answering questions with data. To do this, data is fed into prediction functions, which learn from the data, and use this knowledge to produce inferences.

Today we take an intuitive, non-mathematical look at two genres of prediction machine: regression and classification. Whereas these approaches may seem unrelated, we shall discover a deep symmetry lurking below the surface.

Introducing Regression

Consider a supermarket that has made five purchases of sugar from its supplier in the past. We therefore have access to five data points:

Regression Classification- Regression Data (3)

One of our competitors intends to buy 40kg of sugar. Can we predict the price they will pay?

This question can be interpreted visually as follows:

Regression Classification- Regression Prediction Visualization

But there is another, more systematic way to interpret this request. We can differentiate training data (the five observations where we know the answer) versus test data (where we are given a subset of the relevant information, and asked to generate the rest):

Regression Classification- Regression Prediction Schema

A regression prediction machine will for any hypothetical x-value, predicts the corresponding y-value. Sound familiar? This is just a function. There are in fact many possible regression functions, of varying complexity:

Regression Classification- Simple vs Complex Regression Outputs

Despite their simple appearance, each line represents a complete prediction machine. Each one can, for any order size, generate a corresponding prediction of the price of sugar.

Introducing Classification

To illustrate classification, consider another example.

Suppose we are an animal shelter, responsible for rescuing stray dogs and cats. We have saved two hundred animals; for each, we record their height, weight, and species:
Regression Classification- Classification Data (1)

Suppose we are left a note that reads as follows:

I will be dropping off a stray tomorrow that is 19 lbs and about a foot tall.

A classification question might be: is this animal more likely to be a dog or a cat?

Visually, we can interpret the challenge as follows:

Regression Classification- Classification Prediction Interpretation (2)

As before, we can understand this prediction problem as taking information gained from training data, to generate “missing” factors” from test data:

Regression Classification- Classification Prediction Schema

To actually build a classification machine, we must specify a region-color map, such as the following:

Regression Classification- Simple Classification Output

Indeed, the above solution is complete: we can produce a color (species) label for any new observation, based on whether it lies above or below our line. 

But other solutions exist. Consider, for example, a rather different kind of map:Regression Classification- Complex Classification Boundary

We could use either map to generate predictions. Which one is better? We will explore such questions next time.

Comparing Prediction Methods

Let’s compare our classification and regression models. In what sense are they the same?

Regresssion Classification- Comparing Models

If you’re like me, it is hard to identify similarities. But insight is obtained when you compare the underlying schemas:

Regresssion Classification- Comparing Schema

Here we see that our regression example was 2D, but our classification example was 3D. It would be easier to compare these models if we removed a dimension from the classification example.

Regresssion Classification- 1D Classification (1)

With this simplification, we can directly compare regression and classification:

Regresssion Classification- More Direct Comparison

Thus, the only real difference between regression and classification is whether the prediction (the dependent variable) is continuous or discrete.

Until next time.

Evolutionary Game Theory

Part Of: Game Theory sequence
Content Summary: 1300 words, 13 min read

Prisoner’s Dilemma Review

The classical Prisoner’s Dilemma has following setup:

Two prisoners A and B are interrogated, and separated asked about one another.

  • If both prisoners betray the other, each of them serves 2 years in prison
  • If A betrays B but B remains silent, A will be set free and B will serve 3 years (and vice versa)
  • If both prisoners remain silent, they will only serve 1 year in prison.

We can express the decision structure graphically:

IPD- Prisoner's Dilemma Overview

We can also represent the penalty structure. In what follows, arrows represent preference. CC → DC is true because, given that B cooperates, A would prefer the DC outcome (0 years in prison) more than CC (1 year).

IPD- Prisoner's Dilemma Regret

Our takeaways from our exploration of the Prisoner’s Dilemma:

  • An outcome is strategic dominance happens when one choice outperforms other choices, irrespective of competitor behavior. Here, DD is strategically dominant.
  • Pareto improvement is a way to improving at least one person’s outcome without harming any other player. Here, DD → CC represents such an improvement.
  • Pareto optimal outcomes are those outcomes which cannot be Pareto-improved.

The Prisoner’s Dilemma shows us that strategically-dominant outcomes need not be Pareto optimal.  Although each arrow points towards the origin for that color, the sum of all arrows points away from the origin.

It packages together the tragedy of the commons, a profound and uncomfortable fact of social living. A person can be incentivized towards an outcome that she, and everybody else, dislikes.

Towards Iterated Prisoner’s Dilemma (IPD)

In the one-off game, mutual defection is the only (economically) rational move. If a person chooses to defect, they will likely receive a bad result.

But consider morwhat happens in a more social setting, where players compete for resources multiple times. An Iterated Prisoner’s Dilemma (IPD) has the following structure:

ipd

What strategy is best? Let’s consider two kinds of strategies we might adopt. We can imagine some vindictive prisoners always defecting (AD). Other prisoner’s might be more generous, adopting a Tit-for-Tat (TfT) strategy. This has them initially cooperating, and mirroring their opponent’s previous move.

Let’s imagine that there are 200 “prisoners” playing this game, with each strategy adopted by half of the population. Which strategy should you adopt, in such a scenario?

The games look as follows:

  • AD vs AD: { DD, DD, DD,  … }. After 10 rounds: A has 20 years, B has 20 years.
  • AD vs TfT: { CD, DD, DD,  … }. After 10 rounds: A has 18 years, B has 21 years.
  • TfT vs TfT: { CC, CC, CC, … }. After 10 rounds: A has 10 years, B has 10 years.

These computations can be generalized to n rounds:

IPD- Always Defect vs TfT

The tit-for-tat (TfT) strategy wins because TfT-TfT games are collaborative, but these players also aren’t effectively exploited by players who Always Defect (AD).

Which Kinds of Strategies Are Best?

There is an very large number of possible IPD strategies. Strategy design might include considerations such as:

  • Deterministic vs Mixed. Should we follow logical rules, or employ randomness?
  • Impersonal vs Personal. Do we remember the behavior of each opponent? Do we change strategies given what we know of other players?
  • Fixed vs Adaptive. Should we use our experiences to change the above on-the-fly?

Given this behavioral diversity, which kinds of strategy are most successful?

To answer this question, in 1980 Robert Axelrod conducted a famous experiment. He invited hundreds of scholars to enter an IPD tournament, submitting their agent’s decision algorithm digitally. In a computer simulation, every agent played every other agent 200 times. The agent with highest cumulative utility was declared the winner.

Many agent strategies employed quite complex, using hundreds of lines of code. The surprising result was that simple strategies, including Tit-for-Tat, often proved to be superior. Axelrod described three properties shared among successful strategies:

IPD- Characteristics of Winning Strategy

We can call such strategies instances of reciprocal altruism.

Moral and Emotional Implications

The theory of evolution has shown us that biological systems are the product of an optimization process known as natural selection. Only genes that improve reproductive success win over evolutionary time.

From this context, it has long seemed unclear how human beings (and other animals) came to express altruistic behavior.  W.D Hamilton’s notion of inclusive fitness explains why we behave generously to relatives. As J.B.S Haldane famously joked,

I would willingly die for two brothers or eight cousins.

Game theory explains our behavior towards non-relatives. Specifically,

IPD provides insight into moral cognition. It shows how our selfish genes might, purely for selfish reasons, come to promote behaviors that are (reciprocally) altruistic.

IPD similarly explains certain emotional processes. For example, I have posited elsewhere the existence of social intuition generators like Fairness. We can now explain why natural selection generated such “socially intelligent” mental modules.

Application: Vampire Bats

Instead of jail time, we can modify our outcome structure to be more relevant to biology.

IPD- Ecological Prisoner's Dilemma (1)

Thus, we can use game theory to interpret animals competing for resources. Consider, for example, behavior of the vampire bats.

Vampire bats feed on the blood of other mammals. Their energy budget is such that they can tolerate 2 days of food deprivation before starving to death.

On a given night, 8% of adult vampire bats will fail to find food on a given night. But when they do find food, it is often more than they need.

Of course, these animals have a genetic incentive to share blood within family. But you can also observe bats sharing their food with strangers.

How can selfish genes reward altruistic behavior? Because vampire bats are playing IPD:

  • CC (Reward). I get blood on my unlucky nights. I have to give blood on my lucky nights, which doesn’t cost me too much.
  • DC (Temptation). You save my life on my poor night. But I also don’t have to feed you on my good night.
  • CD (Sucker): I pay the cost of saving your life on my good night. But on my bad night I still may starve.
  • DD (Punishment) I don’t have to feed you on my good nights. But I run a real risk of starving on my poor nights.

Towards Evolutionary Game Theory

To show why altruistic bats are more successful? Yes; we need only invent evolutionary game theory (EGT). Recall how natural selection works:

Individuals with more biological fitness tend to leave more copies of their genes.

EGT simply adds this replicator logic to the Iterated Prisoner’s Dilemma (IPD). Players with higher final scores (most resources) leave more descendants in subsequent populations (image credit):

EGT

We saw previously that Tit-For-Tat players outperform those who Always Defect. In EGT, this fact establishes how a gene that promotes altruism successfully invaded the vampire bat gene pool:

IPD- EGT Stable Strategies (2)

Of course, iterated games don’t always have one winner. Consider the following food web (structurally similar to Rock-Paper-Scissors, of course).

Snake beats Fox. Fox beats Hawk. Hawk beats snake.

What if the size of the snake population starts out quite small? In that case, hawks and foxes predominate. Since hawks are prey to foxes, the size of the hawk population decreases. But this means the snakes have fewer natural predators.

The above traces the implications of one possible starting point. However, we can use EGT maths to model the entire dynamical system, as follows (image credit):

IPD- Food Web Rock Paper Scissors (1)

With this image, we can see that any starting point will eventually (after many generations), lead to a (⅓, ⅓, ⅓) divide of snakes, foxes, and hawks. This point is the locus of the “whirlpool”, it is also known as an attractor, or an evolutionarily stable state (ESS).

Takeaways

  • The Iterated Prisoner’s Dilemma (IPD) makes game theory more social, where many players compete for resources multiple times.
  • While one-off PD games favor selfish behavior, IPD can favor strategies that feature reciprocal altruism, such as Tit-for-Tat.
  • More generally, IPD strategies do best if they are nice, retaliating, and forgiving. This in turn explains how certain facets of our social and moral intuitions evolved.
  • Evolutionary Game Theory (EGT) extends IPD by adding replicator logic (more successful strategies are preferentially represented in future generations).
  • Evolutionary Stable States (ESS) encode dynamical attractors, which populations asymptotically approach.

Until next time.

Singular Value Decomposition

Part Of: Algebra sequence
Followup To: Eigenvalues and Eigenvectors
Content Summary: 1300 words, 13 min read.

Limitations of Eigendecomposition

Last time, we learned how to locate eigenvalues and eigenvectors of a given matrix. Diagonalization is the process where a square matrix can be decomposed into two factors: a matrix Q (with eigenvectors along the columns) and a matrix \Lambda (with eigenvalues along the diagonal).

A = Q \Lambda Q^T

But as we saw with the spectral theorem, eigendecomposition only works well against square, symmetric matrices. If a matrix isn’t symmetric, it is easy to run into complex eigenvalues. And if a matrix isn’t square, you are out of luck entirely!

Eigendecomposition- Spectral Theorem (1)

Can we generalize eigendecomposition to apply to a wider family of matrices? Can we diagonalize matrices of any size, even if they don’t have “nice” properties?

Yes. Self-transposition is the key insight of our “eigendecomposition 2.0”. We define the self-transpositions of A as AA^{T} and A^{T}A.

Suppose A \in \mathbb{R}^{m x n}. Then AA^T \in \mathbb{R}^{mxm} and A^TA \in \mathbb{R}^{nxn}. So these matrices are square. But they are also symmetric!

To illustrate, consider the following.

A = \begin{bmatrix}  4 & 4 \\ -3 & 3 \\ \end{bmatrix}

Since A is not symmetric, we have no guarantee that its eigenvalues are real. Indeed, its eigenvalues turn out to be complex:

\det(A - \lambda I) = \begin{bmatrix} 4 - \lambda & 4 \\ -3 & 3 - \lambda \\ \end{bmatrix} = 0

(12 - 7 \lambda + \lambda^2) + 12 = 0 \Rightarrow \lambda^2 -7 \lambda + 24 = 0

\lambda = \frac{7 \pm \sqrt{(-7)^2 - 4*1*24}}{2*1} = \frac{7}{2} \pm \frac{\sqrt{47}i}{2}

Eigendecomposition on A sucks. Are the self-transposed matrices any better?

A^TA = \begin{bmatrix} 4 & -3 \\ 4 & 3 \\ \end{bmatrix} \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix} = \begin{bmatrix} 25 & 7 \\ 7 & 25 \\ \end{bmatrix}

AA^T = \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix} \begin{bmatrix} 4 & -3 \\ 4 & 3 \\ \end{bmatrix} = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}

These matrices are symmetric! Thus, they are better candidates for eigendecomposition.

Towards Singular Value Decomposition

Singular Value Decomposition (SVD) is based on the principle that all matrices are eigendecomposable after self-transposition. It is essentially a bug fix:

SVD- Self-Transposition

An important way to picture SVD is with the idea of orthogonal bases. It is relatively easy to find any number of orthogonal bases for a given rowspace. Call the matrix of orthogonal vectors V.

We desire to find an orthogonal basis V such that AV produces an orthogonal basis in the column space. Orthogonal bases are not particularly hard to find. But most orthogonal bases, once projected to column space, will lose their orthogonality! We desire that particular orthogonal basis U such that V is also orthogonal.

We won’t require vectors in U to be the same size as those in V. Instead, we will normalize V; its basis vectors will be orthonormal (orthogonal and normal). Then the length of each vector in U will diverge by a scaling factor.

As we will soon see, these scaling factors are not eigenvalues. Instead, we will use sigmas instead of lambdas.

  • Scaling factors \sigma , analogous to eigenvalues \lambda.
  • Diagonal matrix \Sigma , analogous to diagonal matrix \Lambda

Our full picture then, looks like this:

SVD- Orthogonal Basis Transfer

Let us now translate this image into matrix language.

A \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ v_1 & v_2 & \dots & v_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \sigma_1u_1 & \sigma_2u_2 & \dots & \sigma_nu_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}

But we can easily factorize the right-hand side:

\begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \sigma_1u_1 & \sigma_2u_2 & \dots & \sigma_nu_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ u_1 & u_2 & \dots & u_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} * \begin{bmatrix} \sigma_1 & 0 & \dots & 0 \\ 0 & \sigma_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_n \\ \end{bmatrix}

So we have that:

AV = U \Sigma

Since both V and U are orthogonal, inversion of either is equivalent to transposition:

A = U \Sigma V^{-1} = U \Sigma V^T

This strongly resembles our diagonalization equation A = Q \Lambda Q^T. SVD distinguishes itself by considering two orthogonal eigenmatrices U and V, not just one (Q).

Recalling that A = U \Sigma V^T ,

A^TA = (V \Sigma^T U^T) (U \Sigma V^T)

But now the innermost term cancels. Since \Sigma is a square diagonal matrix, its self-transposition is simply equal to \Sigma^{2}. So,

A^TA = V \Sigma^2 V^T

Since A^{T}A is a square, symmetric matrix, our diagonalization theorem applies!

A^TA = V \Sigma ^2 V^T = Q \Lambda Q^T

To find U, a similar trick works:

AA^T =  (U \Sigma V^T)(V \Sigma^T U^T) = U \Sigma^2 U^T = Q \Lambda Q^T

The relationships between SVD and eigendecomposition are as follows:

  • V is the eigenvectors of A^TA
  • U is the eigenvectors of AA^T
  • \Sigma is the square root of the eigenvalues matrix \Lambda

If any eigenvalue is negative, the corresponding sigma factor would be complex. But A^TA and AA^T are positive-semidefinite, which guarantees non-negative eigenvalues. This assures us that \Sigma contains only real values.

In contrast to eigendecomposition, every matrix has an SVD decomposition. Geometrically, V and U act as rotational transformations, and Sigma acts as a scaling transformation. In other words, every linear transformation comprises a rotation, then scaling, then another rotation.

SVD- Rotation Scaling Rotation (1)

A Worked Example

Let’s revisit A. Recall that:

A = \begin{bmatrix} 4 & 4 \\ -3 & 3 \\ \end{bmatrix}, A^TA = \begin{bmatrix} 25 & 7 \\ 7 & 25 \\ \end{bmatrix}, AA^T = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}

Eigendecomposition against A is unpleasant because A is not symmetric. But A^{T}A is guaranteed to be positive semi-definite; that is, to have non-negative eigenvalues. Let’s see this in action.

det(A^TA - \lambda I) = \begin{bmatrix} 25 - \lambda & 7 \\ 7 & 25 - \lambda \\ \end{bmatrix} = 0

\lambda^2 - 50 \lambda + 576 = 0 \Rightarrow (\lambda_1 - 32)(\lambda_2 - 18) = 0

\lambda_1 = 32, \lambda_2 = 18

trace(A^TA) = 50 = \sum{\lambda_i}

det(A^TA) = 625-49 = 576 = 18 * 32 = \prod{\lambda_i}

These are positive, real eigenvalues. Perfect! Let’s now derive the corresponding (normalized) eigenvectors.

A - 32I = \begin{bmatrix} -7 & 7 \\ 7 & -7 \\ \end{bmatrix} = 0, A - 18I = \begin{bmatrix} 7 & 7 \\ 7 & 7 \\ \end{bmatrix} = 0

rref(A - 32I) = \begin{bmatrix} -1 & 1 \\ 0 & 0 \\ \end{bmatrix} = 0, rref(A-18I) = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ \end{bmatrix} = 0

v_1 = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}, v_2 = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}} \\ \end{bmatrix}

SVD intends to decompose A into U \Sigma V^{T}. The above findings give us two of these ingredients.

V^{T} = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \end{bmatrix}

\Sigma =\begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix}

What’s missing? U! To find it, we perform eigendecomposition on AA^{T}. This is an especially easy task, because AA^{T} is already a diagonal matrix.

AA^T = \begin{bmatrix} 32 & 0 \\ 0 & 18 \\ \end{bmatrix}

U = \begin{bmatrix} u_1 & u_2 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix}

We have arrived at our first Singular Value Decomposition.

A = U \Sigma V^T = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & - \frac{1}{\sqrt{2}} \\ \end{bmatrix}

Okay, so let’s check our work. 😛

A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} \sqrt{32} & 0 \\ 0 & \sqrt{18} \\ \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & - \frac{1}{\sqrt{2}} \\ \end{bmatrix} \right) = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} 4 & 4 \\ 3 & 3 \\ \end{bmatrix} = \begin{bmatrix} 4 & 4 \\ 3 & 3 \\ \end{bmatrix}

These matrices are a viable factorization: multiplication successfully recovers A.

Takeaways

  • Eigendecomposition only works for a subclass of matrices; SVD decomposes all matrices.
  • SVD relies on self-transposition to convert any arbitrary matrix into one that works well against eigendecomposition (guarantees square m = n and symmetric A = A^{T}).
  • Another way to interpret SVD is by taking a special kind of orthogonal basis that, once passed through the linear transformation, preserves its orthogonality.
  • Every matrix A = U \Sigma V^{T}. That is, every linear transformation can be conceived as rotation + scaling + rotation.

Until next time.

Eigenvalues and Eigenvectors

Part Of: Algebra sequence
Followup To: An Introduction to Linear Algebra
Next Up: Singular Value Decomposition
Content Summary: 1300 words, 13 min read

Geometries of Eigenvectors

Matrices are functions that act on vectors, by mapping from row-vectors to column-vectors.  Consider two examples:

  1. Reflection matrices, which reflect vectors across some basis.
  2. Rotation matrices, which rotate vectors clockwise by \theta degrees.

eigenvectors-transformation-geometries-1

The set of eigenvectors of a matrix A is a special set of input vectors for which the matrix behaves as a scaling transformation. In other words, we desire the set of vectors \vec{x} whose output vectors A\vec{x} differ by a scaling factor.

Eigenvectors have a straightforward geometric interpretation:

  1. Reflection eigenvectors are orthogonal or parallel to the reflecting surface. In the left image above, that is the top two pairs of vectors.
  2. Rotation eigenvectors do not exist (more formally, cannot be visualized in \mathbb{R}^2).

Algebra of Eigenvectors

We can express our “parallel output” property as:

A\vec{x} = \lambda \vec{x}

Thus \vec{x} and A\vec{x} point in the same direction, but differ by scaling factor \lambda.

Scaling factor \lambda is the eigenvalue. There can be many \left( x, \lambda \right) pairs that satisfy the above equality.

For an \mathbb{R}^{n x n} matrix, there are n eigenvalues. These eigenvalues can be difficult to find. However, two facts aid our search:

  • The sum of eigenvalues equals the trace (sum of values along the diagonal).
  • The product of eigenvalues equals the determinant.

To solve, subtract \lambda \vec{x} from both sides:

A\vec{x} = \lambda \vec{x}

(A - \lambda I)\vec{x} = 0

We would like to identify n unique eigenvectors. But if the new matrix (A - \lambda I) has an empty nullspace, it will contain zero eigenvectors. So we desire this new matrix to be singular.

How to accomplish this?  By finding eigenvalues that satisfy the characteristic equation \det(A - \lambda I) = 0. Matrices are singular iff their determinants equal zero.

Let’s work through an example! What is the eigendecompositon for matrix A:

A = \begin{bmatrix} 3 & 1 \\ 1 & 3 \\ \end{bmatrix}

We need to find eigenvalues that solve the characteristic equation.

\det(A - \lambda I) = \begin{vmatrix} 3-\lambda & 1 \\ 1 & 3-\lambda \\ \end{vmatrix} = 0

(3 - \lambda)^2 - 1^2 = \lambda_2 -6\lambda + 8 = (\lambda-2)(\lambda-4) = 0

\lambda_1 = 2, \lambda_2 = 4

Are these eigenvalues correct? Let’s check our work:

trace(A) = 6 = \sum{\lambda_i}

det(A) = 8 = \prod{\lambda_i}

How to find our eigenvectors? By solving the nullspace given each eigenvalue.

For \lambda_1=2 :

A - 2I = \begin{bmatrix} 1 & 1 \\ 1 & 1 \\ \end{bmatrix} \Rightarrow rref(A - 2I) = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ \end{bmatrix}

(\lambda_1, \vec{x}_1) = (2, \begin{bmatrix} 1 \\ -1 \\ \end{bmatrix})

For \lambda_2=4 :

A - 4I = \begin{bmatrix} -1 & 1 \\ 1 & -1 \\ \end{bmatrix} \Rightarrow rref(A - 4I) = \begin{bmatrix} -1 & 1 \\ 0 & 0 \\ \end{bmatrix}

(\lambda_2, \vec{x}_2) = (4, \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix})

Desirable Matrix Properties

The above example was fairly straightforward. But eigendecomposition can “go awry”, as we shall see. Consider a rotation matrix, which in two dimensions has the following form:

R = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \\ \end{bmatrix}

What are the eigenvalues for rotation \theta = 90^{\circ} ?

R = \begin{bmatrix} 0 & -1 \\ 1 & 0 \\ \end{bmatrix}

\det(R - \lambda I) = \begin{bmatrix} - \lambda & -1 \\ 1 & - \lambda \\ \end{bmatrix} = 0

(- \lambda)^2 - 1^2 \Rightarrow \lambda^2 = 1

\lambda_1 = i, \lambda_2 = -i

We can check our work:

trace(R) = 0 = \sum{\lambda_i}

det(R) = 1 = \prod{\lambda_i}

We saw earlier that rotation matrices have no geometric interpretation. Here, we have algebraically shown that its eigenvalues are complex.

A = \left[ \begin{smallmatrix} 3 & 1 \\ 1 & 3 \\ \end{smallmatrix} \right] has real eigenvalues, but R = \left[ \begin{smallmatrix} 0 & -1 \\ 1 & 0 \\ \end{smallmatrix} \right] has less-desirable complex eigenvalues.

We can generalize the distinction between A and R as follows:

Spectral Theorem. Any matrix that is symmetric (A = AT) is guaranteed to have real, nonnegative eigenvalues. The corresponding n eigenvectors are guaranteed to be orthogonal.

In other words, eigendecomposition works best against symmetric matrices.

Eigendecomposition- Spectral Theorem (1)

Diagonalization

Let us place each eigenvector in the column of a matrix S. What happens when you multiply the original matrix A by this new matrix? Since S contains eigenvectors, multiplication by A reduces to multiplication by the associated eigenvalues:

AS = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \lambda_1x_1 & \lambda_2x_2 & \dots & \lambda_nx_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix}

We see the product contains a mixture of eigenvalues and eigenvectors. We can separate these by “pulling out” the eigenvalues into a diagonal matrix. Call this matrix \Lambda (“capital lambda”).

AS = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ x_1 & x_2 & \dots & x_n \\ \vdots & \vdots & \vdots & \vdots \\ \end{bmatrix} * \begin{bmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_n \\ \end{bmatrix} = S \Lambda

Most matrices have the property that its eigenvectors are linearly independent. For such matrices, S is invertible. Given this fact, we can solve for A:

\Lambda = S^{-1} A S

A = S \Lambda S^{-1}

Matrices that can be factorized in this way are said to be diagonalizable. We can see that both elimination and eigendecomposition are performing the same type of work: factorizing matrices into their component parts.

If A is symmetric, then we know \Lambda is real, and its eigenvectors in S are orthogonal. Let us rename S to be Q, to reflect this additional property. But orthogonal matrices have the property that transposition equats inversion: Q^T = Q^{-1}. Thus, if A is symmetric, we can simplify the diagonalization formula to:

A = Q \Lambda Q^{-1} = Q \Lambda Q^T

Asymptoptic Interpretations

This diagonalization approach illustrates an important use case of eigenvectors: power matrices. What happens when A is applied arbitrarily many times? What does the output look like in the limit?

We can use the diagonalization equation to represent A^k:

A^k = \prod_{i=1}^{k} (Q^{-1} \Lambda Q) = (Q^{-1} \Lambda Q)(Q^{-1} A Q)\dots(Q^{-1} \Lambda Q)

We can simplify by canceling the inner terms QQ^{-1}:

A^k = Q^{-1} \Lambda^k Q

This equation tells us that the eigenvectors is invariant to how many times A is applied. In contrast, eigenvalue matrix \Lambda has important implications for ongoing processes:

  • If each eigenvalue has magnitude less than one, the output will trend towards zero.
  • If each eigenvalue has magnitude greater than one, the output will trend to infinity.

Fibonacci Eigenvalues

The powers interpretation of eigenvalues sheds light on the behavior of all linear processes. This includes number sequences such as the Fibonacci numbers, where each number is the sum of the previous two numbers.

Recall the Fibonacci numbers are 0,1,1,2,3,5,8,13,... What is F_{100} ?

Eigenvalues can answer this question. We must first express the Fibonacci generator as a linear equation:

F(k+2) = 1F(k+1) + 1F(k)

In order to translate this into a meaningful matrix, we must add a “redundant” equation

F(k+1) = 1F(k+1) + 0F(k)

With these equations, we can create a 2×2 Fibonacci matrix F.

F = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix}

This matrix uniquely generates Fibonacci numbers.

u_1 = Fu_0 = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix} \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix}

u_4 = F^4u_0 = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ \end{bmatrix}^4 \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} = \begin{bmatrix} 5 \\ 3 \\ \end{bmatrix}

To discover the rate at Fibonacci numbers grow, we decompose F into its eigenvalues:

\det(F - \lambda I) = \begin{vmatrix} 1 - \lambda & 1 \\ 1 & 1- \lambda \\ \end{vmatrix} = 0

 \lambda^2 - 2\lambda - 1 = 0

 \lambda_1 = \frac{1 + \sqrt{5}}{2}, \lambda_2 = \frac{1 - \sqrt{5}}{2}

 \lambda_1 = 1.61803, \lambda_2 = -0.61803

trace(F) = 1 = \sum{\lambda_i}

det(F) = -1 = \prod{\lambda_i}

We can go on to discover eigenvectors x_1 and x_2. We can then express the Fibonnaci matrix F as

F = \lambda_1x_1 + \lambda_2 x_2

F^k = \lambda_1^k x_1 + \lambda_2^k x_2

As k goes to infinity, the second term goes to zero. Thus, the ratio is dominated by the larger eigenvalue, 1.61803.

Mathematicians in the audience will recognize this number as the golden ratio.

eigenvectors-fibonacci-and-golden-ratio

We have long known that the ratio of successive Fibonnaci numbers converges to 1.61803. Eigenvalues provide a mechanism to derive this value analytically.

Until next time.

An Introduction to Linear Algebra

Part Of: Algebra sequence
Content Summary: 1300 words, 13 min read.

Linear algebra is the math of vectors and matrices. Let me attempt to explain it as succinctly as possible.

Vector Operations

If n is a positive integer and ℝ is the set of real numbers, then ℝn is the set of all n-tuples of real numbers.  A vector v ∈ ℝn is one such n-tuple. For example,

V = (v_1, v_2, v_3) \in (\mathbb{R}, \mathbb{R}, \mathbb{R}) \equiv \mathbb{R}^3

The six vector operations are: addition, subtraction, scaling, cross product, dot product, norm (vector length). For vectors u and v, these are defined as follows:

linear-algebra-vector-operations-1

The dot product and the cross product norm are related to the angle between two vectors. Specifically,

u \cdotp v = \|u\| \|v\| cos(\theta)

\|u \times v\| = \|u\| \|v\| sin(\theta)

Finally, note the the cross product is not commutative: u x v != v x u.

Matrix Operations

A matrix A ∈ ℝm x n is a rectangular array of real numbers with m rows and n columns. For example,

A= \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ a_{31} & a_{32} \end{bmatrix} \in \begin{bmatrix} \mathbb{R} & \mathbb{R} \\ \mathbb{R} & \mathbb{R} \\ \mathbb{R} & \mathbb{R} \\ \end{bmatrix} \equiv \mathbb{R}^{3x2}

Important matrix operations are addition, subtraction, product, transpose, and determinant:

linear-algebra-matrix-operations-2

Of the above operations, matrix product is especially significant. This operation is only possible when A has the same number of columns as B has rows.

Matrix as Morphism

Importantly, matrices with one row or one column can interpreted as vectors. Matrix multiplication with a vector can be interpreted in two ways. The row picture interprets the output as a dot product, the column picture interprets the output as a linear combination.

linear-algebra-row-vs-column-picture-4

Let’s take A to be a 2 by 2 matrix with row-vectors (2, -1) and (-1,2), and let b = (0,3). When we solve for x, we find there are two ways to do so:

linear-algebra-row-vs-column-geometry-1

In the above example, we set b = (0,3). But this linear combination could generate any vector b. In general, linear combinations can generate the entirety of a space (in this case, a two-dimensional plane). 

In both views, we say A transform x → b. If vectors are groups of data, matrices are functions that operate on vectors. Specifically, multiplication by a matrix A ∈ ℝm x n we will call a linear transformation, that converts an n-tuple into a m-tuple TA : ℝn → ℝm.

The symmetry between functions and linear transformations runs deep. When we explore category theory, we will discover the structure underpinning this relationship. But in the meantime, consider the following equivalences:

linear-algebra-function-vs-linear-transformation-4

Especially important to matrix functions is the notion of inverse matrix. Just as f-1(x) = ln(x) undoes the effects of f(x) = ex, inverse matrix A-1 undoes the effects of A. The cumulative effect of applying A-1 after A is the identity matrix A-1A = 1. The identity matrix is analogous to multiplying by one, or adding zero: it is algebraically inert.

Only square matrices can be invertible, because reversal is commutative AA-1 = A-1A = 1. But not all square matrices have inverses. The determinant of a matrix checks for invertibility. Det(A) != 0 iff A is invertible.

Matrix inversion is vitally important to solving systems of linear equations.

  • AX = B → multiply both sides by A-1 on the right → X = A-1B
  • XA = B → multiply both sides by A-1 on the right → X = BA-1
  • ABXC = E → put C-1 on the right and B-1A-1 on the left → X = B-1A-1EC-1

Elimination As Inverse Solver

To solve Ax = b, we must find inverse matrix A-1. This can be done via Gauss-Jordan Elimination. This method affords three row operations:

  1. Linear Combination: Ri += kRj
  2. Scaling: Ri *= k
  3. Row Swap: Ri ⇆ Rj

After creating a matrix of the form Ax = b, we can solve for x by creating an augmented matrix of the form [ A | b ], and converting the left-hand side into the identity matrix:

linear-algebra-gauss-jordan-elimination-2

To solve for x algebraically, Ax = b → A-1Ax = A-1b → 𝟙x = A-1b. So Gauss-Jordan facilitates the discovery of inverse matrix A-1. We can show this computation explicitly, by setting an augmented matrix of form [ A | 1 ]

linear-algebra-gauss-jordan-inverse-matrix-6

Row operations are functions that act on vectors. So are matrices. Thus, it isn’t very surprising that our three row operations each correspond to an elementary matrix. The elementary matrix is similar to the identity matrix; in the below picture, only the shaded region diverges. 

linear-algebra-elementary-matrices-7

We now see that the row operations of Gauss Jordan elimination are a kind of factoring: we are finding the elementary matrices underlying A-1:

linear-algebra-gauss-jordan-elementary-matrices-2

Thus E4E3E2E1 = A-1 and Ax = b → (E4E3E2E1)Ax = (E4E3E2E1)b → x = A-1b

Fundamental Vector Spaces

A vector space consists of a set of vectors and all linear combinations of these vectors. The set of all linear combinations is called the span. For example, the vector space S = span{ v1, v2 } consists of all vectors of the form v = ɑv1 + βv2, where ɑ and β are real numbers.

Vector spaces can be characterized by a basis: the original set of vectors whose linear combination creates the vector space. The vectors of any basis must be linearly independent, or orthogonal. You can check whether two vectors or orthogonal by confirming their dot product u · v = 0. The dimension of a vector space is the cardinality of the basis (number of orthogonal vectors).

Recall that matrices are functions that project vectors from ℝn to ℝm. This corresponds to projecting from row space to column space, as follows:

linear-algebra-fundamental-space-interpretation-6

We will unpack the full implications of this graphic another time. For now, it suffices to understand that discovering the bases of these subspaces is a useful exercise.

For example, if we show that a matrix has an non-empty kernel, this in itself is proof of non-invertibility. Why? Because if TA sends a vector to the zero vector, there is no TA-1 that can undo this operation.

We can say more. For an n x n matrix A, the following statements are equivalent:

  1. A is invertible
  2. The determinant of A is nonzero.
  3. The RREF of A is the n x n identity matrix
  4. The rank of the matrix is n
  5. The row space of A is ℝn
  6. The column space of A is ℝn
  7. A doesn’t have a null space (only the zero vector N(A) = {0} )

Elimination as Basis Identifier

We have seen elimination discover (and factorize) the matrix inverse. But what happens when an inverse does not exist? Well, elimination has a natural, unique stopping point:

linear-algebra-rref-derivation-2

Matrices in reduced row echelon form (RREF) have the following properties:

  1. Rows with all zeroes are moved to the bottom.
  2. The first nonzero number from the left (pivot) is to the right of the pivot above it.
  3. Every pivot equals 1 and is the only nonzero entry in its column

Gauss-Jordan Elimination can replace every matrix with an equivalent one in rref form. Note that we have seen several examples of elimination creates the identity matrix 1  when A is invertible. This is only possible because 1 fits the rref criteria.

Once a matrix is in rref form, it becomes trivial to discover the basis for its three fundamental spaces:

  • R(A) basis: all row vectors that are non-zero
  • C(A) basis: all column vectors that contain pivot.
  • N(A) basis: the solution to Ax = 0

linear-algebra-deriving-fundamental-spaces-2

We see that rank(A) + nullity(A) = 3 + 1 = 4, which indeed is the number of columns. Also, the column space of A equals the row space of AT. This is not an accident, since transpose simply swaps rows with columns.

Takeaways

  • Vectors are groups of numbers; matrices are functions that act on vectors.
  • We can solve linear equations by conducting Gauss-Jordan elimination.
  • Gauss-Jordan elimination-based solution rely on searching for inverse matrix A-1
  • The domain of matrices is its row vectors, its codomain is its column vectors.
  • Even if a matrix is not invertible, elimination can find its most reduced form (RREF).
  • RREF matrices can be used to derive fundamental subspaces.

Until next time.

Related Works