policy | Fewer Lacunae

Part Of: Reinforcement Learning sequence
Followup To: An Introduction To Markov Chains
Content Summary: 900 words, 9 min read

Motivations

Today, we turn our gaze to Markov Decision Processes (MDPs), a decision-making environment which supports our propensity to learn from good and bad outcomes. We represent outcome desirability with a single number, R. This value is used to refine action selection: given a particular situation, what action will maximize expected reward?

In biology, we can describe the primary work performed by an organism is to maintain homeostasis: maintaining metabolic energy reserves, body temperature, etc in a widely varying world.

Cybernetics provide a clear way of conceptualizing biological reward. In Neuroendocrine Integration, we discussed how brains must respond both to internal and external changes. This dichotomy expresses itself as two perception-action loops: a visceral body-oriented loop, and a cognitive world-centered one.

Rewards are computed by the visceral loop. To a first approximation, reward encode progress towards homeostasis. Food is perceived as more rewarding when the body is hungry, this is known as alliesthesia. Reward information is delivered to the cognitive loop, which helps refine its decision making.

Reinforcement Learning- Reward As Visceral Efferent

Extending Markov Chains

Recall that a Markov Chain contains a set of states S, and a transition model P. A Markov Decision Process (MDP) extends this device, by adding three new elements.

Specifically, an MDP is a 5-tuple (S, P, A, R, ɣ):

A set of states s ∈ S
A transition model P_a(s’ | s).
A set of actions a ∈ A
A reward function R(s, s’)
A discount factor ɣ

To illustrate, consider GridWorld. In this example, every location in this two-dimensional grid is a state, for example (1,0). State (3,0) is a desirable location: R(s(3,0)) = +1.0, but state (3,1) is undesirable, R(s(3,1)) = -1.0. All other states are neutral.

Gridworld supports four actions, or movements: up, down, left, and right. However, locomotion is imperfect: if Up is selected, the agent will only move up with 80% probability: 20% of the time it will go left or right instead. Finally, attempting to move into a forbidden square will simply return the agent to its original location (“hitting the wall”).

Reinforcement Learning- Example MDP Gridworld

The core problem of MDPs is to find a policy (π), a function that specifies the agent’s response to all possible states. In general, policies should strive to maximize reward, e.g., something like this:

Reinforcement Learning- Example MDP Policy

Why is the policy at (2,2) Left instead of Up? Because (2,1) is dangerous: despite selecting Up, there is a 10% chance that the agent will accidentally move Right, and be punished.

Let’s now consider an environment with only three states A, B, and C. First, notice how different policies change the resultant Markov Chain:

reinforcement-learning-policy-vs-markov-chain-1

This observation is important. Policy determines the transition model.

Towards Policy Valuation V(s)

An agent seeks to maximize reward. But what does that mean, exactly?

Imagine an agent selects 𝝅₁. Given the resultant Markov Chain, we already know how to use matrix multiplication to predict future locations S_t. The predicted reward P_t is simply the dot product of expected location and the reward function.

$P_t = S_t \cdot R$

markov-chains-linear-algebra-expected-value

We might be tempted to define the value function V(S) as the sum of all predicted future rewards:

$V_O(S) = P_0 + P_1 + P_2 + P_3 + \dots = \sum{P_k}$

However, this approach is flawed. Animals value temporal proximity: all else equal, we prefer to obtain rewards quickly. This is temporal discounting: as rewards are further removed from the present, their value is discounted.

In reinforcement learning, we implement temporal discounting with the gamma parameter: rewards that are k timesteps away are multiplied by the exponential discount factor $\gamma^k$ . The value function becomes:

$V_O(S) = P_0 + \gamma P_1 + \gamma^2 P_2 + \gamma^3 P_3 + \dots = \sum{\gamma^k P_k}$

Without temporal discounting, V(s) can approach infinity. But exponential discounting ensures V(s) equals a finite value. Finite valuations promote easier computation and comparison of state evaluations. For more on temporal discounting, and an alternative to the RL approach, see An Introduction to Hyperbolic Discounting.

Intertemporal Consistency

In our example, at time zero our agent starts in state A. We have already used linear algebra to compute our P_k predictions. To calculate value, we simply compute $latex \sum{\gamma^k P_k}$

$V_0(A) = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3$

Agents compute V(s) at every time step. At t=1, two valuations are relevant:

$V_1(A) = 0 + 0 + 0.64 \gamma^2 + \dots$

$V_1(B) = 0 + 0.8 \gamma + 0.96 \gamma^2 + \dots$

mdp-value-function-timeslice

What is the relationship between the value functions at t=0 and t=1? To answer this, we need to multiply each term by $\gamma P(X|A)$ , where $X$ is the state being considered at the next time step.

$W_1(A) \triangleq \gamma 0.2 V_1(A)$

$W_1(A) = 0 + 0 + (0.2)(0.64)\gamma^3 + \dots$

Similarly,

$W_1(B) \triangleq \gamma P(B|A)V_1(B) = \gamma 0.8 V_1(B)$

$W_1(B) 0 + (0.8)(0.8) \gamma^2 + (0.8)(0.96) \gamma^3 + \dots$

Critically, consider the sum $X = r_0(s) + W_1(A) + W_1(B)$ :

$X = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3 + \dots$

MDP- Intertemporal Consistency

Does $X_0$ look familiar? That’s because it equals $V_0(A)$ ! In this way, we have a way of equating a valuation at t=0 and t=1. This property is known as intertemporal consistency.

Bellman Equation

We have seen that $V_0(A) = X_0$ . Let’s flesh out this equation, and generalize to time t.

$V_t(s) = r_t(A) + \gamma \sum{P(s'|s)V_{t+1}(s')}$

This is the Bellman Equation, and it is a central fixture in control systems. At its heart, we define value in terms of both immediate reward and future predicted value. We thereby break up a complex problem into small subproblems, a key optimization technique that can be approached with dynamic programming.

Next time, we will explore how reinforcement learning uses the Bellman Equation to learn strategies with which to engage its environment (the optimal policy 𝝅). See you then!

Table Of Contents

Back To Basics
Meet The English System
A Cognition-Friendly Design
Global Trends
Policy Proposal
What Use Are Policy Proposals?
Bonus Proposal!

Hm. So, I enjoy discussing this topic. Maybe if I write about it, my Will To Rant will weaken! (Family & friends will be thanking me in no time. 😉 )

Back To Basics

Do you remember how long one meter is? Extend your arms to approximate its length. Now say “meter” about eighteen times, until you achieve semantic satiation. Okay good, I’ve confused you. Your familiarity high was stunting your ability to learn.

Why must a meter be that long? What forbids it from being defined differently?

Nothing. All measurement conventions are arbitrary. Thus, it is possible for every person to use different measurement rules.

But that isn’t how society operates. Why? How do we explain measurement convergence?

It is a cultural technology: it moves attention away from the communicative vehicle and to its content.

Does the above remind you of anything? It should. If I swap out the nouns, I’d be talking about language. The analogy strength is considerable. (Have you yet figured out the mechanism that underwrites analogy strength?)

The funny thing about language is that globalization is murdering it. Of the 6500 languages alive today, fewer than half will survive to 2100 ACE. If you combine this fact to our analogy, you are mentally equipped to forge a prediction:

We expect the number of measurement systems to be decreasing.

Meet The English System

In fact, only two comprehensive measurement systems remain. Here is a snapshot of one of them, the English system:

Chances are that you live in the US, and chances are you’ve wrestled with the question “how many ounces in a quart” once in your life.

Let’s be explicit about why we don’t like the above:

There is no discernible pattern between the equivalency values (e.g., 2, 1760, 2240, 43,560…) or words (e.g., “cup”, “pint”, “quart”, “gallon”)

Do you agree? Is this is the reason why you winced at the above table?

Even if we agree, we aren’t done. We still need to explain where our complaint comes from. And that explanation is, of course, cognitive:

Patterns facilitate memorization, improving performance of long-term memory.
Patterns allow for compression, reducing the load on working memory.

A Cognition-Friendly Design

If you were to design a solution to the above problems from scratch, how would you do it?

I doubt I would have been able to invent this technology independently: it is intimidatingly brilliant. Time to meet the quantitative prefix. The basic idea is: why don’t we link equivalency values to the grammar, and infuse mathematical meaning into our prefixes?

The metric prefix is a kind of quantitative prefix. It encodes scale, in increments of 10^3 (i.e., 1000), by the following:

You can allow your sense of familiarity back in the room. You have, of course, used quantitative prefixes all your life. Do you recognize the words “milli-meter”, “kilo-gram”, “giga-byte”? Well, now you have another tool under your belt: you can now precisely understand words you’ve become accustomed to, and rapidly absorb the meaning of new combinations. Two examples:

If someone were to ask you “what does a micro-gram mean?” you could answer “a millionth of a gram!”
If someone were to ask you “how many bytes in 4 gigabytes?” you could answer “4,000,000,000”! *

(* Unless the person who said gigabyte ACTUALLY meant 4 gibibytes, which is NOT the same thing, and a totally separate rant. 🙂 )

Notice that, with this technology, we have the same root word, and only need to modify the prefix to expand our vocabulary. More pleasant, no?

Global Trends

Recall our prediction, that the number of measurement systems would decrease over time. And it has. All countries marked in green use the Metric system:

Notice any outliers? 🙂

It’s not like the United States hasn’t tried. In 1975, Congress passed the Metric Conversion Act… but its efforts were largely disbanded in 1982. You can read more here if you like.

Policy Proposal

Proposal: The United States should pursue metrication.

Some drawbacks: Such legislation will cost money, and be inconvenient in the short term.

Some benefits: Improved international relations, promotion of less fuzzy thinking, working memory generally freed up for other tasks.

To me, I’m more worried about the possibility of systemic failure: perhaps any political action that incur short-term-cost in exchange for long-term gain are generally considered hazardous. Perhaps, for example, we could introduce a legislation timers so that the fallout from “eat your vegetables” bills don’t fall on their signatories.

Yes, I’m aware the above example is completely broken. But it is meant to signal the kind of thinking we need: infrastructure refactoring.

What Use Are Policy Proposals?

A large amount of ink has been spilled on the metric system. Many of these contributions dive to a depth greater than mine. I do not expect my career to involve the comprehensive analysis of policy ramifications, the meticulous construction of actionable proposals. I am a voice in the wind. Why do I bother?

I will be collecting policy proposals on this blog for several reasons. Beyond my philosophy of politics, I write because it may bring value to the world, and it helps organize my mental life. I also would like to ultimately find collaborators, like-minded individuals interested in researching with me. But I also write because I hope my unconventional emphases will someday unlock relatively-novel ideas that are of good quality. Here’s an example of an idea that may come from my cognitive emphasis above (no promises on quality though :P):

The above solution of quantitative prefix was ultimately a marriage of mathematical reasoning and grammatical systems. I am unable to technically specify the full cognitive algorithm for why this combination works (yet, darn it!). But it opens the door to brainstorming: how else could we leverage language to crystallize and augment our rational capacities? And then you start casting around for ideas.

Bonus Proposal!

A stream-of-consciousness illustration of the kind of transhumanist creativity I am encouraging.

For me, I recall reading speculations that perhaps one reason Chinese kids tend to score highly in math is because the digits are easier to pronounce. I then search for “chinese digits pronunciation” and find this paper. An excerpt:

These data offer support for the hypothesis that differences in digit memory between Chinese and English speakers are derived, in part, from differences in the time required to pronounce number words in the two languages.

I then wonder if a numeric system could be engineered to supplant our “one”, “two”, “three”, etc with a system more like Chinese, to enhance students’ cognitive capacities. But not exactly Chinese numerals – that phonetic system carries other disadvantages. I envision a new numerical phonetics that, engineered with state-of-the-art computational models of working memory, brings empirically-demonstrable cognitive advantages over its “natural” competitors.

See you next time.

Fewer Lacunae

Distilled, Integrative Research

Tag policy

Markov Decision Processes

Policy Proposal: Metrication