Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

PSet 12: Model-Free

(videos: Direct Evaluation and Temporal Difference)

  1. Assume we have a policy for grid world that is actually optimal, π=π\pi=\pi^*, but our values are all 0. Compute VπV^{\pi} estimates for (3,3) and (3,2) using Direct Evaluation. You observe the following four epochs, in (col,row) format, starting at (3,2) and alternating states and actions:

    1. (3,2) N (3,3) E (4,3)

    2. (3,2) N (3,3) E (3,3) E (4,3)

    3. (3,2) N (4,2)

    4. (3,2) N (3,3) E (4,3)

    γ=0.9\gamma = 0.9

  2. Same problem, but use TD learning. Start each training episode at (3,2), and run the same four epochs as shown above. Show your work. Update Vπ(s)V^{\pi}(s) for (3,2) and (3,3) as you go. All states begin with Vπ(s)=0V^{\pi}(s)=0. Clearly show the final values after the 4 epochs.

    γ=0.9\gamma=0.9, αi=1ci\alpha_i = \frac{1}{c_i}

    As a reminder of the update equation from the video and notes:

    Vπ(s)Vπ(s)+α(R(s,a,s)+γVπ(s)Vπ(s))V^\pi(s) \leftarrow V^\pi(s) + \alpha( R(s,a,s') + \gamma V^\pi(s') - V^\pi(s) )