Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Reinforcement Learning

Optional, detailed notes from UW

While Value Iteration is an elegant solution for Markov Decision Processes (MDPs), it faces a significant practical hurdle: the requirement of a complete environmental model. Specifically, look at the Bellman equations:

V(s)=maxaQ(s,a)V^*(s) = \max_a Q^*(s,a)
Q(s,a)=sT(s,a,s)[R(s,a,s)+γV(s)]Q^*(s,a) = \sum_{s'} T(s,a,s') [R(s,a,s') + \gamma V^*(s')]

However we now have a big question: where do the transition function T(s,a,s)T(s,a,s') and the reward function R(s,a,s)R(s,a,s') come from? In most real-world applications, these functions are not known.

One solution is to let an agent explore. By placing an agent in the environment and allowing it to move, we can record the outcomes of various actions.

While we could build a model and then run Value Iteration, this Model-Based approach is slow—requiring extensive exploration followed by intensive computation. This leads us to Model-Free Learning: can we estimate values directly from experience without explicitly building a model?


Direct Estimation

Direct Estimation is the most intuitive approach to model-free learning. While it has efficiency flaws, it provides the conceptual foundation for more advanced methods. The basic idea is that we will take samples from the space of values and estimate them directly. What does that mean? Let’s start with a simple example.

Consider calculating the expected age of students in a classroom:

What will this mean for our MDPs? Well, we could try to estimate the values at various states by running several episodes. Each episode is a sample. In each episode, after the agent passes through a state, keep track of the discounted reward for that state throughout the episode. When the episode is done, those sums are a single sample for the value of each state. Run multiple episodes, and average.

Summary of Direct Estimation:


Temporal Difference (TD) Learning

We want to use the sampling approach of Direct Estimation but also incorporate the state-space structure. Instead of waiting for an entire episode to finish to get one sample, we update our estimates after every single step.

We treat a single transition as a sample of the value:

sample1=R(s,a,s)+γVπ(s)sample2=R(s,a,s)+γVπ(s)etc....sample_1 = R(s,a,s') + \gamma V^\pi(s')\\ sample_2 = R(s,a,s') + \gamma V^\pi(s')\\ etc....

And then we can make our updated value for state s be the average of the samples:

Vπ(s)=1nisampleiV^\pi(s) = \frac{1}{n} \sum_{i} sample_i

This can still be problematic in that we don’t know where V came from, and we’ll run a long time with bad value. What if we update V as we go? There’s a way to do this with a running average.

The Running Average

The basic idea is we have a stream of numbers coming in, one after the other, and we want to compute the average of them at each moment. We could keep a sum total, and at each step, divide by the number of numbers we’ve seen. But there are reasons we don’t want to do that, like if we’re going to see a lot of numbers, the sum will overflow our integer variables. Let’s assumbe the numbers come in n0,n1,n2,...n0,n1,n2,.... The averages will be:

a0=n0a1=(n0+n1)/2a1=(10.5)a0+0.5n1a2=(n0+n1+n2)/3a2=(10.3333)a1+0.3333n2a_0 = n_0\\ a_1 = (n_0+n_1)/2\\ a_1 = (1 - 0.5)a_0 + 0.5n_1\\ a_2 = (n_0+n_1+n_2)/3\\ a_2 = (1 - 0.3333)a_1 + 0.3333n_2

So we can keep a running average by wieghting the previous average by (1−α) and the new number by α in such a way that α is decreasing. By adjusting the size of α we can put disproportionate weight on newer or older samples. If α stays constant, we end up weighting more recent values more.

This is a handy trick, but what does that mean for V? Well, let’s keep a running average: Given a stream of numbers, we can maintain an average by weighting the previous estimate by (1α)(1-\alpha) and the new sample by α\alpha:

Vπ(s)(1α)Vπ(s)+α(sample)V^\pi(s) \leftarrow (1-\alpha)V^\pi(s) + \alpha(\text{sample})

This is often rewritten to show the “update” to the current value:

Vπ(s)Vπ(s)+α(sampleVπ(s))V^\pi(s) \leftarrow V^\pi(s) + \alpha( sample - V^\pi(s) )

Vπ(s)Vπ(s)+α(R(s,a,s)+γVπ(s)Vπ(s))V^\pi(s) \leftarrow V^\pi(s) + \alpha( R(s,a,s') + \gamma V^\pi(s') - V^\pi(s) )

Because we are looking at the difference between values at different points in time, this is called Temporal Difference (TD) learning. It is acceptable to weight newer samples more heavily because early samples are based on inaccurate, initial values of VV. To guarantee convergence, we decrease α\alpha over time, often as a function of state visits: αt=1n(st)\alpha_t = \frac{1}{n(s_t)}.

This is great. It turns out that this can work really well for lots of problems, but it has one flaw: we still don’t get a policy out of it! The policy is the argmax of the Qs, but the Qs still require us to know the model (T). That will be solved next.

So is temporal difference useless?

Not completely. Recall that there were multiple problems with value iteration. One was that it required a model, but another is that is there are a very large number of states, it can take too long. TD learning can be used in problems where we DO KNOW the model, but the state space is too large to run Value Iteration. You can sample from the model and act like the model is unknown, and TD will work as if you’re just observing the world.

What are examples of this? Games. TD learning was at the core of this first AI backgammon player to play on an expert level.


Q-Learning

But still, we’d like to be able to use this TD technique so that we actually get a policy out of it, not just the values. To extract a policy without a model, we apply the TD technique to the Q-function rather than the Value function. Recall that:

π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s,a)

Accurate Q-values give us the policy directly.

The Q-Update Rule

We derive the update rule by substituting the relationship V(s)=maxaQ(s,a)V^*(s') = \max_{a'} Q^*(s', a') into our TD sample:

sample=R(s,a,s)+γmaxaQ(s,a)\text{sample} = R(s,a,s') + \gamma \max_{a'} Q(s', a')

Then, as before, we’ll take our samples and incorporate them into a running average:

sampleR(s,a,s)+γmaxaQπ(s,a)sample \leftarrow R(s,a,s')+ \gamma \max_{a'} Q^{\pi}(s',a')
Qπ(s,a)(1α)Qπ(s,a)+(α)sampleQ^{\pi}(s,a) \leftarrow (1−\alpha)Q^{\pi}(s',a)+(\alpha)sample

The full Q-Learning update becomes:

Qπ(s,a)Qπ(s,a)+α[R(s,a,s)+γmaxaQπ(s,a)Qπ(s,a)]Q^{\pi}(s,a) \leftarrow Q^{\pi}(s,a) + \alpha [R(s,a,s') + \gamma \max_{a'} Q^{\pi}(s', a') - Q^{\pi}(s,a)]

Now, we have a model-free way to learn the optimal policy. As we explore, we update our Q-values, which immediately improves our policy, which in turn makes our exploration more effective. This has another nice effect. Since the accuracy of Q-values depends on the policy being run, as we explore, we can immediately update our policy, and then our exploration will be better, and our Q-values more accurate, and then we can update our policy to be better.


The Exploration-Exploitation Trade-off

Q-learning is guaranteed to converge to the optimal policy as long as every state is visited an infinite number of times. That means we can run any policy as long as it visits all the states enough times eventually. But all that time we have several problems:

  1. If we have a bad policy, we’re getting bad reward. We’re missing out!

  2. If the bad policy happens to omit several states, we won’t ever learn about them

  3. Infinity is a long time, and we probably want to spend our time exploring areas of the space more likely to get us good stuff.

So, we’ll definitely want to update out policy as we go. But even so, we can have bad luck, and get a policy that stays away from a part of the space that is needed. What if the environment is changing? So what we would like to do is mostly act according to policy, but sometimes take exploratory actions.

This introduces the Exploration-Exploitation trade-off:

Common strategies include: