Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Markov Decision Processes


Markov Decision Processes

It’s time to expand beyond two-player games and present a more general model of identifying optimal behavior in an environment.

A common starting example is to use what’s called “Grid World” which is just a 2x2 grid of squares where you start in one, and want to move from square to square (north/east/west/south) to reach some goal square. Our lecture video looks sort of like this:

+1B1start\begin{array}{|c|c|c|c|} \hline \quad & \quad & \quad & +1 \\ \hline \quad & B & \quad & -1 \\ \hline start & \quad & \quad & \quad \\ \hline \end{array}

Your goal is to reach the +1 and avoid the pit of doom at -1.

Markov Decision Processes (MDPs) are made up of:

The transition function returns the probability of ending in state ss' with action aa from state ss. So if you choose a=Westa=West, you might end up going North with some probability. Our running example in the video is that you achieve your action with probability 0.8 and 0.1 chance of going left and 0.1 of going right from your intended direction.

Finally, the Reward function gives you the reward of being there. Many states are reward zero.

Why Markov?

These are called Markov Decision Processes because it uses the Markov Assumption which simplifies the world by saying we only need the current state of the world to make our prediction about the next state. You don’t need more history, just the current state.

Why use an MDP?

The goal of an MDP is to come up with an optimal Policy which is defined as identifying the single best action to take in any given state. In our Grid World, this is just labeling each square in the grid with the best action.

Policy is represented by π\pi and the optimal policy is π\pi^*.

You should be thinking of Expectimax when considering the Grid World, and MDPs are a generalization of that model, but provides a richer framework and a dynamic programming algorithm that is more efficient and able to solve these kinds of problems.

See the lecture videos or textbook for more details and motivation.

The MDP Horizon

When defining Markov Decision Processes (MDPs) and trying to identify the optimal policy, a critical question arises regarding the time horizon: How far into the future are we looking?

If we assume an infinite horizon where the agent operates forever, the accumulated utility U(r1,r2,r3,)U(r_1,r_2,r_3,\dots) could conceivably grow unbounded. This poses a mathematical challenge for comparing policies, as many policies might result in infinite utility.

Fortunately, we can resolve this by introducing a discount factor, γ\gamma, typically chosen in the range 0γ<10 \leq \gamma <1. With discounting, the Utility function is guaranteed to be bounded. This convergence is derived from the geometric series summation formula, for any infinite sum:

tb0+tb1+tb2+=t1bif 0b<1t b^0 + t b^1+ t b^2+ \cdots = \frac{t}{1-b} \quad \text{if } 0 \leq b <1

By substituting γ\gamma for bb, we see that the total Utility must be bounded by a constant, provided rewards are finite.

Solving MDPs: The Expectimax Analogy

Now that we have formally defined MDPs, how do we “solve” them?

We can conceptualize the solution process using a framework similar to expectimax. In this search tree analogy:

From the chance nodes, the system transitions to new max nodes (states) based on specific probabilities. These probabilities are defined by the Transition Function, T(s,a,s)T(s,a,s').

Value Functions

To formalize the quality of states and actions, we introduce the Value of a state.

The Value of a State: V(s)V(s)

We define the value of a state, represented as V(s)V(s), as the sum of the immediate reward received from that state plus the sum of all future rewards obtained after moving on from that state.

Predicting the future value creates a dependency on behavior: the value depends heavily on how the agent acts in the future. To address this, we assume the agent acts optimally. We denote V(s)V^*(s) as the value of the state (the sum of its reward and all future rewards (discounted by γ\gamma)) if the agent is acting optimally.

Since VV is a function of states, the various VV values in the search tree are associated with the max nodes.

The Quality of a State-Action Pair: Q(s,a)Q(s,a)

We can also estimate the utility of chance nodes that represent the result of taking a particular action. This is a function of a pair: a state and an action. We call this QQ (standing for Quality).

Q(s,a)Q(s,a) represents the estimate of the discounted sum of future rewards obtained by taking action aa in state ss. Similar to state Values, the Q-value depends on how the agent behaves in the future. Therefore, we define Q(s,a)Q^*(s,a) to be the Q-value if the agent acts optimally thereafter performing the initial action aa.

Policies and Optimal Values

Recall that a policy, denoted as π\pi, defines the agent’s behavior. The optimal policy is denoted as π\pi^*.

When we calculate VV or QQ assuming the agent follows a specific policy π\pi, we denote this as VπV^{\pi}. Consequently, the optimal value VV^* is mathematically equivalent to VπV^{\pi^*}.

The Bellman Equations

How do we actually calculate V(s)V^*(s)? We use a recursive definition known as the Bellman Equation.

First, the value of a state is simply the maximum Q-value available from that state:

V(s)=maxaQ(s,a)V^*(s) = \max_a Q^*(s,a)

Second, how do we find QQ? The Q-value is the weighted average of the outcomes of the action:

Q(s,a)=sT(s,a,s)[R(s)+γV(s)]Q^*(s,a) = \sum_{s'}T(s,a,s')[R(s') + \gamma V^*(s')]

Often, these two steps are combined into a single recursive equation:

V(s)=maxasT(s,a,s)[R(s)+γV(s)]V^*(s) = \max_a \sum_{s'}T(s,a,s')[R(s') + \gamma V^*(s')]

Solution Algorithms

We could attempt to solve these equations using expectimax search. However, standard expectimax performs redundant work in spaces where a small number of states are highly interconnected (cyclic graphs).

Instead, we utilize Dynamic Programming (specifically an algorithm called Value Iteration). This approach effectively starts at the “leaves” of the conceptual tree and works its way up, layer by layer, refining the value estimates until they converge.

Examples