Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

PSet 10: MDP Basics

(videos: MDPs and Reward)

  1. For the grid world in the lecture, calculate what states can be reached from (1,1) with the action sequence [North,North] and with what probabilities. Use the video’s grid (1,1) is the bottom left, (1,2) is middle of leftmost column, etc. Give all possible ending states with their probabilities after the two North moves.

  2. For the following world with rewards:

    101\begin{array}{|c|c|c|c|c|} \hline 10 & \quad & \quad & \quad & 1 \\ \hline \end{array}

    All actions succeed in their intentions. When a reward is received, the game is over.

    a. What is the optimal policy when γ=0.9γ = 0.9?

    b. What is the optimal policy when γ=0.1γ = 0.1?

  3. For the following world with rewards:

    10a1\begin{array}{|c|c|c|c|c|c|} \hline 10 & \quad & \quad & a & \quad & 1 \\ \hline \end{array}

    At what γγ is it equally good to go east or west in the state labeled aa?

Challenge Problems

  1. You can formulate reward as a function of a state R(s) or as a function of state and action R(s,a). Show that if you have a formulation of an MDP with the reward in R(s,a) form, you can convert it to R(s) form, and vice versa. You may want to create additional states.

  2. Is the optimal policy in one formulation equivalent to the optimal policy in the other?