# An Introduction to Prospect Theory

Part Of: [Neuroeconomics] sequence
Content Summary: 1500 words, 15 min reading time

Preliminaries

Decisions are bridges between perception and action. Not all decisions are cognitive. Instead, they occur at all levels of the abstraction hierarchy, and include things like reflexes.

Theories of decision tend to constrain themselves to cognitive phenomena. They come in two flavors: descriptive (“how does it happen”) and normative (“how should it happen”).

Decision making often occur in the context of imperfect knowledge. We may use probability theory as a language to reason about uncertainty.

Let risk denote variance in the probability distribution of possible outcomes. Risk can exist regardless of whether a potential loss is involved. For example, a prospect that offers a 50-50 chance of paying $100 or nothing is more risky than a prospect that offers$50 for sure – even though the risky prospect entails no possibility of losing money.

Today, we will explore the history of decision theory, and the emergence of prospect theory. As the cornerstone of behavioral economics, prospect theory provides an important theoretical surface to the emerging discipline of neuroeconomics.

Maximizing Profit with Expected Value

Decision theories date back to the 17th century, and a correspondence between Pascal and Fermat. There, consumers were expected to maximize expected value (EV), which is defined as probability p multiplied by outcome value x.

$EV = px$

To illustrate, consider the following lottery tickets:

Suppose each ticket costs 50 cents, and you have one million dollars to spend. Crucially, it doesn’t matter which ticket you buy! Each of these tickets have the same expected value: $1. Thus, it doesn’t matter if you spend the million dollars on A, B, or C – each leads to the same amount of profit. The above tickets have equal expected value, but they do not have equal risk. We call people who prefer choice A risk averse; whereas someone who prefers C is risk seeking. Introducing Expected Utility Economic transactions can be difficult to evaluate. When trading an apple for an orange, which is more valuable? That depends on a person’s unique tastes. In other words, value is subjective. Let utility represent subjective value. We can treat utility as a function u() that operates on objective outcome x. Expected utility, then, is highly analogous to expected value: $EU = pu(x)$ Most economists treat utility functions as abstractions: people act as if motivated by a utility function. Neuroeconomic research, however, suggests that utility functions are physically constructed by the brain. Every person’s utility function may be different. If a person’s utility curve is linear, then expected utility converges onto expected value: $EU \rightarrow EV \mid u(x) = x$ Recall in the above lottery, the behavioral distinction between risk-seeking (preferring ticket A) and risk-averse (preferring C). Well, in practice most people prefer A. Why? We can explain this behave by appealing to the shape of the utility curve! Utility convexity produces risk aversion: In the above, we see the first$50 (first vertical line) produces more utility (first horizontal line) than the second $50. Intuitively, the first$50 is needed more than the second $50. The larger your wealth, the less your need. This phenomenon is known as diminishing marginal returns. Neoclassical Economics In 1947, von Neumann and Morgenstern formulated a set of axioms that are both necessary and sufficient for representing a decision-maker’s choices by the maximization of expected utility. Specifically, if you assume an agent’s preference set accomodates these axioms… 1. Completeness. People have preferences over all lotteries. $\forall L_1, L_2 \in L$ either $L_1 \leq L_2$ or $L_1 \geq L_1$ or $L_1 = L_2$ 2. Transitivity. Preferences are expressed consistently. $\forall L_1, L_2, L_3 \in L$ if $L_1 \leq L_2$ and $L_1 \leq L_2$ then $L_1 \leq L_3$ 3. Continuity. Preferences are expressed as probabilities. $L_1, L_2, L_3 \in L$ then $\exists \alpha, B$ s.t. $L_1 \geq L_2 \geq L_3$ iff $\alpha L_1 + (1-\alpha)L_3 \geq L_2 \geq BL_1 + (1 - B)L_3$ 4. Independence of Irrelevant Alternatives (IIA). Binary preferences don’t change by injecting a third lottery. … then those preferences always maximize expected utility. $L_1 \geq L_2$ iff $sum(p_1u(x_1) \geq p_2u(x_2)$ The above axioms constitute expected utility theory, and form the cornerstone for neoclassical economics. Expected utility theory bills itself as both a normative and descriptive theory: that we understand human decision making, and have a language to explain why it is correct. Challenges To Independence Axiom In the 1970s, expected utility theory came under heavy fire for failing to predict human behavior. The emerging school of behavioral economics gathered empirical evidence that Neumann-Morgenstern axioms were routinely violated in practice, especially the Independence Axiom (IIA). For example, the Allais paradox asks our preferences for the following choices: Most people prefer A (“certain win”) and D (“bigger number”). But these preferences are inconsistent, because C = 0.01A and D = 0.01B. The independence axiom instead predicts that A ≽ B if and only if C ≽ D. The Decoy effect is best illustrated with popcorn: Towards a Value Function Concurrently to these criticisms of the independence axiom, the heuristics and biases literature (led by Kahneman and Tversky) began to discover new behaviors that demanded explanation: • Risk Aversion. In most decisions, people tend to prefer smaller variance in outcomes. • Everyone prefers gains over losses, of course. Loss Aversion reflects that losses are felt more intensely than gains of equal magnitude. • The Endowment Effect. Things you own are intrinsically valued more highly. Framing decisions as gains or as losses affects choice behavior. Each of these behavioral findings violate the Independence Axiom (IIA), and cumulatively demanded a new theory. And in 1979, Kahneman and Tversky put forward prospect theory to explain all of the above effects. Their biggest innovation was to rethink the utility function. Do you recall how neoclassical economics appealed to $u(x)$ convexity to explain risk aversion? Prospect theory takes this approach yet further, and seeks to explain all of the above behaviors using a more complex shape of the utility function. Let value function $\textbf{v(x)}$ represent our updated notion of utility. We can define expected prospect $\textbf{EP}$ of a function as probability multiplied by the value function $EP = pv(x)$ Terminology aside, each theory only differs in the shape of its outcome function. Let us now look closer at the the shape of $v(x)$: This shape allows us to explain the above behaviors: The endowment effect captures the fact that we value things we own more highly. The reference point in $v(x)$, where $x = 0$, captures the status quo. Thus, the reference point allows us to differentiate gains and losses, thereby producing the endowment effect. Loss aversion captures the fact that losses are felt more strongly than gains. The magnitude of $v(x)$ is larger in the losses dimension. This asymmetry explains loss aversion. We have already explained risk aversion by concavity of the utility function $u(x)$. $v(x)$ retains convexity for material gains. Thus, we have retained our ability to explain risk aversion in situations of possible gains. For losses, $v(x)$ concavity predicts risk seeking. Towards a Weight Function Another behavioral discovery, however, immediately put prospect theory in doubt: • The Fourfold Pattern. For situations that involve very high or very low probabilities, participants often switch their approaches to risk. To be specific, here are the four situations and their resultant behaviors: 1. Fear of Disappointment. With a 95% chance to win$100, most people are risk averse.
2. Hope To Avoid Loss. With a 95% chance to lose $100, most people are risk seeking. 3. Hope Of Large Gain. With a 5% chance to win$100, most people are risk seeking.
4. Fear of Large Loss. With a 5% chance to lose $100, most people are risk averse. Crucially, $v(x)$ fails to predict this behavior. As we saw in the previous section, it predicts risk aversion for gains, and risk seeking for losses: Failed predictions are not a death knell to a theory. Under certain conditions, they can inspire a theory to become stronger! Prospect theory was improved by incorporating a more flexible weight function. $EP = pv(x) \rightarrow EP = w(p)v(x)$ Where $w(p)$ has the following shape: These are in fact two weight functions: 1. Explicit weights represent probabilities learned through language; e.g., when reading the sentence “there is a 5% chance of reward”. 2. Implicit weights represent probabilities learned through experience, e.g., when the last 5 out of 100 trials yielded a reward. This change adds some mathematical muscle to the ancient proverb: Humans don’t handle extreme probabilities well. And indeed, the explicit weight function successfully recovers the fourfold pattern: Takeaways Today we have reviewed theories of expected value, expected utility (neoclassical economics), and prospect theory. Each theory corresponds to a particular set of conceptual commitments, as well a particular formula: $EV = px$ $EU = pu(x)$ $EP = w(p)v(x)$ However, we can unify these into a single value formula V: $V = w(p)v(x)$ In this light, EV and EU have the same structure as prospect theory. Prospect theory distinguishes itself by using empirically motivated shapes: With these tools, prospect theory successfully recovers a wide swathe of economic behaviors. Until next time. # Markov Decision Processes Part Of: Reinforcement Learning sequence Followup To: An Introduction To Markov Chains Content Summary: 900 words, 9 min read Motivations Today, we turn our gaze to Markov Decision Processes (MDPs), a decision-making environment which supports our propensity to learn from good and bad outcomes. We represent outcome desirability with a single number, R. This value is used to refine action selection: given a particular situation, what action will maximize expected reward? In biology, we can describe the primary work performed by an organism is to maintain homeostasis: maintaining metabolic energy reserves, body temperature, etc in a widely varying world. Cybernetics provide a clear way of conceptualizing biological reward. In Neuroendocrine Integration, we discussed how brains must respond both to internal and external changes. This dichotomy expresses itself as two perception-action loops: a visceral body-oriented loop, and a cognitive world-centered one. Rewards are computed by the visceral loop. To a first approximation, reward encode progress towards homeostasis. Food is perceived as more rewarding when the body is hungry, this is known as alliesthesia. Reward information is delivered to the cognitive loop, which helps refine its decision making. Extending Markov Chains Recall that a Markov Chain contains a set of states S, and a transition model P. A Markov Decision Process (MDP) extends this device, by adding three new elements. Specifically, an MDP is a 5-tuple (S, P, A, R, ɣ): • A set of states s ∈ S • A transition model Pa(s’ | s). • A set of actions a ∈ A • A reward function R(s, s’) • A discount factor ɣ To illustrate, consider GridWorld. In this example, every location in this two-dimensional grid is a state, for example (1,0). State (3,0) is a desirable location: R(s(3,0)) = +1.0, but state (3,1) is undesirable, R(s(3,1)) = -1.0. All other states are neutral. Gridworld supports four actions, or movements: up, down, left, and right. However, locomotion is imperfect: if Up is selected, the agent will only move up with 80% probability: 20% of the time it will go left or right instead. Finally, attempting to move into a forbidden square will simply return the agent to its original location (“hitting the wall”). The core problem of MDPs is to find a policy (π), a function that specifies the agent’s response to all possible states. In general, policies should strive to maximize reward, e.g., something like this: Why is the policy at (2,2) Left instead of Up? Because (2,1) is dangerous: despite selecting Up, there is a 10% chance that the agent will accidentally move Right, and be punished. Let’s now consider an environment with only three states A, B, and C. First, notice how different policies change the resultant Markov Chain: This observation is important. Policy determines the transition model. Towards Policy Valuation V(s) An agent seeks to maximize reward. But what does that mean, exactly? Imagine an agent selects 𝝅1. Given the resultant Markov Chain, we already know how to use matrix multiplication to predict future locations St. The predicted reward Pt is simply the dot product of expected location and the reward function. $P_t = S_t \cdot R$ We might be tempted to define the value function V(S) as the sum of all predicted future rewards: $V_O(S) = P_0 + P_1 + P_2 + P_3 + \dots = \sum{P_k}$ However, this approach is flawed. Animals value temporal proximity: all else equal, we prefer to obtain rewards quickly. This is temporal discounting: as rewards are further removed from the present, their value is discounted. In reinforcement learning, we implement temporal discounting with the gamma parameter: rewards that are k timesteps away are multiplied by the exponential discount factor $\gamma^k$. The value function becomes: $V_O(S) = P_0 + \gamma P_1 + \gamma^2 P_2 + \gamma^3 P_3 + \dots = \sum{\gamma^k P_k}$ Without temporal discounting, V(s) can approach infinity. But exponential discounting ensures V(s) equals a finite valueFinite valuations promote easier computation and comparison of state evaluations. For more on temporal discounting, and an alternative to the RL approach, see An Introduction to Hyperbolic Discounting. Intertemporal Consistency In our example, at time zero our agent starts in state A. We have already used linear algebra to compute our Pk predictions. To calculate value, we simply compute$latex \sum{\gamma^k P_k}\$

$V_0(A) = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3$

Agents compute V(s) at every time step. At t=1, two valuations are relevant:

$V_1(A) = 0 + 0 + 0.64 \gamma^2 + \dots$

$V_1(B) = 0 + 0.8 \gamma + 0.96 \gamma^2 + \dots$

What is the relationship between the value functions at t=0 and t=1? To answer this, we need to multiply each term by $\gamma P(X|A)$, where $X$ is the state being considered at the next time step.

$W_1(A) \triangleq \gamma 0.2 V_1(A)$

$W_1(A) = 0 + 0 + (0.2)(0.64)\gamma^3 + \dots$

Similarly,

$W_1(B) \triangleq \gamma P(B|A)V_1(B) = \gamma 0.8 V_1(B)$

$W_1(B) 0 + (0.8)(0.8) \gamma^2 + (0.8)(0.96) \gamma^3 + \dots$

Critically, consider the sum $X = r_0(s) + W_1(A) + W_1(B)$:

$X = 0 + 0 + 0.64 \gamma^2 + 0.896 \gamma^3 + \dots$

Does $X_0$ look familiar? That’s because it equals $V_0(A)$! In this way, we have a way of equating a valuation at t=0 and t=1. This property is known as intertemporal consistency.

Bellman Equation

We have seen that $V_0(A) = X_0$. Let’s flesh out this equation, and generalize to time t.

$V_t(s) = r_t(A) + \gamma \sum{P(s'|s)V_{t+1}(s')}$

This is the Bellman Equation, and it is a central fixture in control systems. At its heart, we define value in terms of both immediate reward and future predicted value. We thereby break up a complex problem into small subproblems, a key optimization technique that can be approached with dynamic programming.

Next time, we will explore how reinforcement learning uses the Bellman Equation to learn strategies with which to engage its environment (the optimal policy 𝝅). See you then!