Markov Decision Processes

Markov Decision Processes¶

It’s time to expand beyond two-player games and present a more general model of identifying optimal behavior in an environment.

A common starting example is to use what’s called “Grid World” which is just a 2x2 grid of squares where you start in one, and want to move from square to square (north/east/west/south) to reach some goal square. Our lecture video looks sort of like this:

\begin{array}{|c|c|c|c|} \hline \quad & \quad & \quad & +1 \\ \hline \quad & B & \quad & -1 \\ \hline start & \quad & \quad & \quad \\ \hline \end{array}

(1)

Your goal is to reach the +1 and avoid the pit of doom at -1.

Markov Decision Processes (MDPs) are made up of:

states $s \in S$
actions $a \in A$
Transition Function $T(s,a,s')$
Reward Function $R(s')$

The transition function returns the probability of ending in state $s'$ with action $a$ from state $s$ . So if you choose $a=West$ , you might end up going North with some probability. Our running example in the video is that you achieve your action with probability 0.8 and 0.1 chance of going left and 0.1 of going right from your intended direction.

Finally, the Reward function gives you the reward of being there. Many states are reward zero.

Why Markov?¶

These are called Markov Decision Processes because it uses the Markov Assumption which simplifies the world by saying we only need the current state of the world to make our prediction about the next state. You don’t need more history, just the current state.

Why use an MDP?¶

The goal of an MDP is to come up with an optimal Policy which is defined as identifying the single best action to take in any given state. In our Grid World, this is just labeling each square in the grid with the best action.

Policy is represented by $\pi$ and the optimal policy is $\pi^*$ .

You should be thinking of Expectimax when considering the Grid World, and MDPs are a generalization of that model, but provides a richer framework and a dynamic programming algorithm that is more efficient and able to solve these kinds of problems.

See the lecture videos or textbook for more details and motivation.

The MDP Horizon¶

When defining Markov Decision Processes (MDPs) and trying to identify the optimal policy, a critical question arises regarding the time horizon: How far into the future are we looking?

If we assume an infinite horizon where the agent operates forever, the accumulated utility $U(r_1,r_2,r_3,\dots)$ could conceivably grow unbounded. This poses a mathematical challenge for comparing policies, as many policies might result in infinite utility.

Fortunately, we can resolve this by introducing a discount factor, $\gamma$ , typically chosen in the range $0 \leq \gamma <1$ . With discounting, the Utility function is guaranteed to be bounded. This convergence is derived from the geometric series summation formula, for any infinite sum:

t b^0 + t b^1+ t b^2+ \cdots = \frac{t}{1-b} \quad \text{if } 0 \leq b <1

(2)

By substituting $\gamma$ for $b$ , we see that the total Utility must be bounded by a constant, provided rewards are finite.

Solving MDPs: The Expectimax Analogy¶

Now that we have formally defined MDPs, how do we “solve” them?

We can conceptualize the solution process using a framework similar to expectimax. In this search tree analogy:

Max nodes correspond to states $s$ . The agent must choose an action to maximize value.
Chance nodes correspond to state-action pairs $(s,a)$ . These represent the state $s$ after the agent has chosen action $a$ but before the environment has responded.

From the chance nodes, the system transitions to new max nodes (states) based on specific probabilities. These probabilities are defined by the Transition Function, $T(s,a,s')$ .

Value Functions¶

To formalize the quality of states and actions, we introduce the Value of a state.

The Value of a State: $V(s)$ ¶

We define the value of a state, represented as $V(s)$ , as the sum of the immediate reward received from that state plus the sum of all future rewards obtained after moving on from that state.

Predicting the future value creates a dependency on behavior: the value depends heavily on how the agent acts in the future. To address this, we assume the agent acts optimally. We denote $V^*(s)$ as the value of the state (the sum of its reward and all future rewards (discounted by $\gamma$ )) if the agent is acting optimally.

Since $V$ is a function of states, the various $V$ values in the search tree are associated with the max nodes.

The Quality of a State-Action Pair: $Q(s,a)$ ¶

We can also estimate the utility of chance nodes that represent the result of taking a particular action. This is a function of a pair: a state and an action. We call this $Q$ (standing for Quality).

$Q(s,a)$ represents the estimate of the discounted sum of future rewards obtained by taking action $a$ in state $s$ . Similar to state Values, the Q-value depends on how the agent behaves in the future. Therefore, we define $Q^*(s,a)$ to be the Q-value if the agent acts optimally thereafter performing the initial action $a$ .

Policies and Optimal Values¶

Recall that a policy, denoted as $\pi$ , defines the agent’s behavior. The optimal policy is denoted as $\pi^*$ .

When we calculate $V$ or $Q$ assuming the agent follows a specific policy $\pi$ , we denote this as $V^{\pi}$ . Consequently, the optimal value $V^*$ is mathematically equivalent to $V^{\pi^*}$ .

The Bellman Equations¶

How do we actually calculate $V^*(s)$ ? We use a recursive definition known as the Bellman Equation.

First, the value of a state is simply the maximum Q-value available from that state:

V^*(s) = \max_a Q^*(s,a)

(3)

Second, how do we find $Q$ ? The Q-value is the weighted average of the outcomes of the action:

Q^*(s,a) = \sum_{s'}T(s,a,s')[R(s') + \gamma V^*(s')]

(4)

Often, these two steps are combined into a single recursive equation:

V^*(s) = \max_a \sum_{s'}T(s,a,s')[R(s') + \gamma V^*(s')]

(5)

Solution Algorithms¶

We could attempt to solve these equations using expectimax search. However, standard expectimax performs redundant work in spaces where a small number of states are highly interconnected (cyclic graphs).

Instead, we utilize Dynamic Programming (specifically an algorithm called Value Iteration). This approach effectively starts at the “leaves” of the conceptual tree and works its way up, layer by layer, refining the value estimates until they converge.

Examples¶

Race Car Example: A simulation of a car on a track requiring optimal control to minimize time while avoiding crashes.
Grid World Example: A standard 2D navigation task where an agent moves through a grid to reach a terminal state while avoiding hazards.