PSet 10: MDP Basics - SI420 Artificial Intelligence

(videos: MDPs and Reward)

For the grid world in the lecture, calculate what states can be reached from (1,1) with the action sequence [North,North] and with what probabilities. Use the video’s grid (1,1) is the bottom left, (1,2) is middle of leftmost column, etc. Give all possible ending states with their probabilities after the two North moves.
For the following world with rewards:
$\begin{array}{|c|c|c|c|c|} \hline 10 & \quad & \quad & \quad & 1 \\ \hline \end{array}$
(1)
All actions succeed in their intentions. When a reward is received, the game is over.
a. What is the optimal policy when $γ = 0.9$ ?
b. What is the optimal policy when $γ = 0.1$ ?
For the following world with rewards:
$\begin{array}{|c|c|c|c|c|c|} \hline 10 & \quad & \quad & a & \quad & 1 \\ \hline \end{array}$
(2)
At what $γ$ is it equally good to go east or west in the state labeled $a$ ?

Challenge Problems¶

You can formulate reward as a function of a state R(s) or as a function of state and action R(s,a). Show that if you have a formulation of an MDP with the reward in R(s,a) form, you can convert it to R(s) form, and vice versa. You may want to create additional states.
Is the optimal policy in one formulation equivalent to the optimal policy in the other?