PSet 12: Model-Free
(videos: Direct Evaluation and Temporal Difference)
Assume we have a policy for grid world that is actually optimal, , but our values are all 0. Compute estimates for (3,3) and (3,2) using Direct Evaluation. You observe the following four epochs, in (col,row) format, starting at (3,2) and alternating states and actions:
(3,2) N (3,3) E (4,3)
(3,2) N (3,3) E (3,3) E (4,3)
(3,2) N (4,2)
(3,2) N (3,3) E (4,3)
Same problem, but use TD learning. Start each training episode at (3,2), and run the same four epochs as shown above. Show your work. Update for (3,2) and (3,3) as you go. All states begin with . Clearly show the final values after the 4 epochs.
,
As a reminder of the update equation from the video and notes: