value function | Fewer Lacunae

Part Of: [Neuroeconomics] sequence
Content Summary: 1500 words, 15 min reading time

Preliminaries

Decisions are bridges between perception and action. Not all decisions are cognitive. Instead, they occur at all levels of the abstraction hierarchy, and include things like reflexes.

Theories of decision tend to constrain themselves to cognitive phenomena. They come in two flavors: descriptive (“how does it happen”) and normative (“how should it happen”).

Decision making often occurs in the context of imperfect knowledge. We may use probability theory as a language to reason about uncertainty.

Let risk denote variance in the probability distribution of possible outcomes. Risk can exist regardless of whether a potential loss is involved. For example, a prospect that offers a 50-50 chance of paying $100 or nothing is more risky than a prospect that offers $50 for sure – even though the risky prospect entails no possibility of losing money.

Today, we will explore the history of decision theory, and the emergence of prospect theory. As the cornerstone of behavioral economics, prospect theory provides an important theoretical surface to the emerging discipline of neuroeconomics.

Maximizing Profit with Expected Value

Decision theories date back to the 17th century, and a correspondence between Pascal and Fermat. There, consumers were expected to maximize expected value (EV), which is defined as probability p multiplied by outcome value x.

$EV = px$

To illustrate, consider the following lottery tickets:

prospect-theory-interchangeable-expected-value-options-2

Suppose each ticket costs 50 cents, and you have one million dollars to spend. Crucially, it doesn’t matter which ticket you buy! Each of these tickets has the same expected value: $1. Thus, it doesn’t matter if you spend the million dollars on A, B, or C – each leads to the same amount of profit.

The above tickets have equal expected value, but they do not have equal risk. We call people who prefer choice A risk averse; whereas someone who prefers C is risk seeking.

Introducing Expected Utility

Economic transactions can be difficult to evaluate. When trading an apple for an orange, which is more valuable? That depends on a person’s unique tastes. In other words, value is subjective.

Let utility represent subjective value. We can treat utility as a function u() that operates on objective outcome x. Expected utility, then, is highly analogous to expected value:

$EU = pu(x)$

Most economists treat utility functions as abstractions: people act as if motivated by a utility function. Neuroeconomic research, however, suggests that utility functions are physically constructed by the brain.

Every person’s utility function may be different. If a person’s utility curve is linear, then expected utility converges onto expected value:

$EU \rightarrow EV \mid u(x) = x$

Recall in the above lottery, the behavioral distinction between risk-seeking (preferring ticket A) and risk-averse (preferring C). Well, in practice most people prefer A. Why?

We can explain this behavior by appealing to the shape of the utility curve! Utility concavity produces risk aversion:

In the above, we see the first $50 (first vertical line) produces more utility (first horizontal line) than the second $50.

Intuitively, the first $50 is needed more than the second $50. The larger your wealth, the less your need. This phenomenon is known as diminishing marginal returns.

Neoclassical Economics

In 1944, von Neumann and Morgenstern formulated a set of axioms that are both necessary and sufficient for representing a decision-maker’s choices by the maximization of expected utility.

Specifically, if you assume an agent’s preference set accommodates these axioms…

1. Completeness. People have preferences over all lotteries.

$\forall L_1, L_2 \in L$ either $L_1 \leq L_2$ or $L_1 \geq L_1$ or $L_1 = L_2$

2. Transitivity. Preferences are expressed consistently.

$\forall L_1, L_2, L_3 \in L$ if $L_1 \leq L_2$ and $L_2 \leq L_3$ then $L_1 \leq L_3$

3. Continuity. Preferences are expressed as probabilities.

$L_1, L_2, L_3 \in L$ then $\exists \alpha, B$ s.t. $L_1 \geq L_2 \geq L_3$ iff $\alpha L_1 + (1-\alpha)L_3 \geq L_2 \geq BL_1 + (1 - B)L_3$

4. Substitution. If you prefer (or are indifferent to) lottery $L_1$ over $L_2$ , mixing both with the same third lottery $L_3$ in the same proportion α must not reverse that preference—adding identical “padding” is irrelevant to the choice.

$\forall\,L_1,L_2,L_3\in\mathcal{L},\;\forall\,\alpha\in(0,1):\; L_1\succeq L_2 \;\Rightarrow\; \alpha L_1 + (1-\alpha)L_3 \succeq \alpha L_2 + (1-\alpha)L_3$

The above axioms constitute expected utility theory, and form the cornerstone for neoclassical economics. Expected utility theory bills itself as both a normative and descriptive theory: that we understand human decision making, and have a language to explain why it is correct.

Challenges To Substitution Axiom

In the 1970s, expected utility theory came under heavy fire for failing to predict human behavior. The emerging school of behavioral economics gathered empirical evidence that von Neumann-Morgenstern axioms were routinely violated in practice, especially the substitution axiom.

For example, the Allais paradox asks our preferences for the following choices:

Most people prefer A (“certain win”) and D (“bigger number”). But these preferences are inconsistent, because C = 0.01A and D = 0.01B. The substitution axiom instead predicts that A ≽ B if and only if C ≽ D.

The Decoy effect contradicts the Independence of Irrelevant Alternatives (IIA). I find it to be best illustrated with popcorn:

Towards a Value Function

Concurrently to these criticisms of the substitution axiom, the heuristics and biases literature (led by Kahneman and Tversky) began to discover new behaviors that demanded explanation:

Risk Aversion. In most decisions, people tend to prefer smaller variance in outcomes.
Everyone prefers gains over losses, of course. Loss Aversion reflects that losses are felt more intensely than gains of equal magnitude.
The Endowment Effect. Things you own are intrinsically valued more highly. Framing decisions as gains or as losses affects choice behavior.

Prospect Theory- Behavioral Effects Economic Biases (1)

Each of these behavioral findings violate the substitution axiom, and cumulatively demanded a new theory. And in 1979, Kahneman and Tversky put forward prospect theory to explain all of the above effects.

Their biggest innovation was to rethink the utility function. Do you recall how neoclassical economics appealed to $u(x)$ concavity to explain risk aversion? Prospect theory takes this approach yet further, and seeks to explain all of the above behaviors using a more complex shape of the utility function.

Let value function $\textbf{v(x)}$ represent our updated notion of utility. We can define expected prospect $\textbf{EP}$ of a function as probability multiplied by the value function

$EP = pv(x)$

Terminology aside, each theory only differs in the shape of its outcome function.

Prospect Theory- Evolution of Utility Function (3)

Let us now look closer at the the shape of $v(x)$ :

This shape allows us to explain the above behaviors:

The endowment effect captures the fact that we value things we own more highly. The reference point in $v(x)$ , where $x = 0$ , captures the status quo. Thus, the reference point allows us to differentiate gains and losses, thereby producing the endowment effect.

Loss aversion captures the fact that losses are felt more strongly than gains. The magnitude of $v(x)$ is larger in the losses dimension. This asymmetry explains loss aversion.

We have already explained risk aversion by concavity of the utility function $u(x)$ . $v(x)$ retains concavity for material gains. Thus, we have retained our ability to explain risk aversion in situations of possible gains. For losses, $v(x)$ convexity predicts risk seeking.

Towards a Weight Function

Another behavioral discovery, however, immediately put prospect theory in doubt:

The Fourfold Pattern. For situations that involve very high or very low probabilities, participants often switch their approaches to risk.

To be specific, here are the four situations and their resultant behaviors:

Fear of Disappointment. With a 95% chance to win $100, most people are risk averse.
Hope To Avoid Loss. With a 95% chance to lose $100, most people are risk seeking.
Hope Of Large Gain. With a 5% chance to win $100, most people are risk seeking.
Fear of Large Loss. With a 5% chance to lose $100, most people are risk averse.

Crucially, $v(x)$ fails to predict this behavior. As we saw in the previous section, it predicts risk aversion for gains, and risk seeking for losses:

Prospect Theory- Fourfold Pattern Actual vs Expected (2)

Failed predictions are not a death knell to a theory. Under certain conditions, they can inspire a theory to become stronger!

Prospect theory was improved by incorporating a probability-weighting function.

$EP = pv(x) \rightarrow EP = w(p)v(x)$

Where $w(p)$ has the following shape:

These are in fact two probability-weighting functions (Hertwig & Erev 2009)

Explicit weights represent probabilities learned through language; e.g., when reading the sentence “there is a 5% chance of reward”.
Implicit weights represent probabilities learned through experience, e.g., when the last 5 out of 100 trials yielded a reward.

This change adds some mathematical muscle to the ancient proverb:

Humans don’t handle extreme probabilities well.

And indeed, the explicit probability-weighting function successfully recovers the fourfold pattern:

Takeaways

Today we have reviewed theories of expected value, expected utility (neoclassical economics), and prospect theory. Each theory corresponds to a particular set of conceptual commitments, as well a particular formula:

$EV = px$

$EU = pu(x)$

$EP = w(p)v(x)$

However, we can unify these into a single value formula V:

$V = w(p)v(x)$

In this light, EV and EU have the same structure as prospect theory. Prospect theory distinguishes itself by using empirically motivated shapes:

Prospect Theory- Evolution of Both Functions

With these tools, prospect theory successfully recovers a wide swathe of economic behaviors.

Until next time.

Works Cited

Hertwig & Erev (2009). The description–experience gap in risky choice

Part Of: Reinforcement Learning sequence
Followup To: An Introduction To Markov Chains
Content Summary: 900 words, 9 min read

Motivations

Today, we turn our gaze to Markov Decision Processes (MDPs), a decision-making environment which supports our propensity to learn from good and bad outcomes. We represent outcome desirability with a single number, R. This value is used to refine action selection: given a particular situation, what action will maximize expected reward?

In biology, we can describe the primary work performed by an organism is to maintain homeostasis: maintaining metabolic energy reserves, body temperature, etc in a widely varying world.

Cybernetics provide a clear way of conceptualizing biological reward. In Neuroendocrine Integration, we discussed how brains must respond both to internal and external changes. This dichotomy expresses itself as two perception-action loops: a visceral body-oriented loop, and a cognitive world-centered one.

Rewards are computed by the visceral loop. To a first approximation, reward encode progress towards homeostasis. Food is perceived as more rewarding when the body is hungry, this is known as alliesthesia. Reward information is delivered to the cognitive loop, which helps refine its decision making.

Reinforcement Learning- Reward As Visceral Efferent

Extending Markov Chains

Recall that a Markov Chain contains a set of states S, and a transition model P. A Markov Decision Process (MDP) extends this device, by adding three new elements.

Specifically, an MDP is a 5-tuple (S, P, A, R, ɣ):

A set of states s ∈ S
A transition model P_a(s’ | s).
A set of actions a ∈ A
A reward function R(s, s’)
A discount factor ɣ

To illustrate, consider GridWorld. In this example, every location in this two-dimensional grid is a state, for example (1,0). State (3,0) is a desirable location: R(s(3,0)) = +1.0, but state (3,1) is undesirable, R(s(3,1)) = -1.0. All other states are neutral.

Gridworld supports four actions, or movements: up, down, left, and right. However, locomotion is imperfect: if Up is selected, the agent will only move up with 80% probability: 20% of the time it will go left or right instead. Finally, attempting to move into a forbidden square will simply return the agent to its original location (“hitting the wall”).

Reinforcement Learning- Example MDP Gridworld

The core problem of MDPs is to find a policy (π), a function that specifies the agent’s response to all possible states. In general, policies should strive to maximize reward, e.g., something like this:

Reinforcement Learning- Example MDP Policy

Why is the policy at (2,2) Left instead of Up? Because (2,1) is dangerous: despite selecting Up, there is a 10% chance that the agent will accidentally move Right, and be punished.

Let’s now consider an environment with only three states A, B, and C. First, notice how different policies change the resultant Markov Chain:

reinforcement-learning-policy-vs-markov-chain-1

This observation is important. Policy determines the transition model.

Towards Policy Valuation V(s)

An agent seeks to maximize reward. But what does that mean, exactly?

Imagine an agent selects 𝝅₁. Given the resultant Markov Chain, we already know how to use matrix multiplication to predict future locations S_t. The predicted reward P_t is simply the dot product of expected location and the reward function.

$P_t = S_t \cdot R$

markov-chains-linear-algebra-expected-value

We might be tempted to define the value function V(S) as the sum of all predicted future rewards:

$V_O(S) = P_0 + P_1 + P_2 + P_3 + \dots = \sum{P_k}$

However, this approach is flawed. Animals value temporal proximity: all else equal, we prefer to obtain rewards quickly. This is temporal discounting: as rewards are further removed from the present, their value is discounted.

In reinforcement learning, we implement temporal discounting with the gamma parameter: rewards that are k timesteps away are multiplied by the exponential discount factor $\gamma^k$ . The value function becomes:

$V_O(S) = P_0 + \gamma P_1 + \gamma^2 P_2 + \gamma^3 P_3 + \dots = \sum{\gamma^k P_k}$

Without temporal discounting, V(s) can approach infinity. But exponential discounting ensures V(s) equals a finite value. Finite valuations promote easier computation and comparison of state evaluations. For more on temporal discounting, and an alternative to the RL approach, see An Introduction to Hyperbolic Discounting.

Intertemporal Consistency

In our example, at time zero our agent starts in state A. We have already used linear algebra to compute our P_k predictions. To calculate value, we simply compute $latex \sum{\gamma^k P_k}$

$V_0(A) = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3$

Agents compute V(s) at every time step. At t=1, two valuations are relevant:

$V_1(A) = 0 + 0 + 0.64 \gamma^2 + \dots$

$V_1(B) = 0 + 0.8 \gamma + 0.96 \gamma^2 + \dots$

mdp-value-function-timeslice

What is the relationship between the value functions at t=0 and t=1? To answer this, we need to multiply each term by $\gamma P(X|A)$ , where $X$ is the state being considered at the next time step.

$W_1(A) \triangleq \gamma 0.2 V_1(A)$

$W_1(A) = 0 + 0 + (0.2)(0.64)\gamma^3 + \dots$

Similarly,

$W_1(B) \triangleq \gamma P(B|A)V_1(B) = \gamma 0.8 V_1(B)$

$W_1(B) 0 + (0.8)(0.8) \gamma^2 + (0.8)(0.96) \gamma^3 + \dots$

Critically, consider the sum $X = r_0(s) + W_1(A) + W_1(B)$ :

$X = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3 + \dots$

MDP- Intertemporal Consistency

Does $X_0$ look familiar? That’s because it equals $V_0(A)$ ! In this way, we have a way of equating a valuation at t=0 and t=1. This property is known as intertemporal consistency.

Bellman Equation

We have seen that $V_0(A) = X_0$ . Let’s flesh out this equation, and generalize to time t.

$V_t(s) = r_t(A) + \gamma \sum{P(s'|s)V_{t+1}(s')}$

This is the Bellman Equation, and it is a central fixture in control systems. At its heart, we define value in terms of both immediate reward and future predicted value. We thereby break up a complex problem into small subproblems, a key optimization technique that can be approached with dynamic programming.

Next time, we will explore how reinforcement learning uses the Bellman Equation to learn strategies with which to engage its environment (the optimal policy 𝝅). See you then!

Fewer Lacunae

Distilled, Integrative Research

Tag value function

An Introduction to Prospect Theory

Markov Decision Processes