PSet 12: Model-Free - SI420 Artificial Intelligence

(videos: Direct Evaluation and Temporal Difference)

Assume we have a policy for grid world that is actually optimal, $\pi=\pi^*$ , but our values are all 0. Compute $V^{\pi}$ estimates for (3,3) and (3,2) using Direct Evaluation. You observe the following four epochs, in (col,row) format, starting at (3,2) and alternating states and actions:
1. (3,2) N (3,3) E (4,3)
2. (3,2) N (3,3) E (3,3) E (4,3)
3. (3,2) N (4,2)
4. (3,2) N (3,3) E (4,3)
$\gamma = 0.9$
Same problem, but use TD learning. Start each training episode at (3,2), and run the same four epochs as shown above. Show your work. Update $V^{\pi}(s)$ for (3,2) and (3,3) as you go. All states begin with $V^{\pi}(s)=0$ . Clearly show the final values after the 4 epochs.
$\gamma=0.9$ , $\alpha_i = \frac{1}{c_i}$
As a reminder of the update equation from the video and notes:
$V^\pi(s) \leftarrow V^\pi(s) + \alpha( R(s,a,s') + \gamma V^\pi(s') - V^\pi(s) )$
(1)