Reinforcement Learning - SI420 Artificial Intelligence

Optional, detailed notes from UW

While Value Iteration is an elegant solution for Markov Decision Processes (MDPs), it faces a significant practical hurdle: the requirement of a complete environmental model. Specifically, look at the Bellman equations:

V^*(s) = \max_a Q^*(s,a)

(1)

Q^*(s,a) = \sum_{s'} T(s,a,s') [R(s,a,s') + \gamma V^*(s')]

(2)

However we now have a big question: where do the transition function $T(s,a,s')$ and the reward function $R(s,a,s')$ come from? In most real-world applications, these functions are not known.

One solution is to let an agent explore. By placing an agent in the environment and allowing it to move, we can record the outcomes of various actions.

Since rewards are typically deterministic in these models, we only need to visit a $(s, a, s')$ triple once to determine the reward.
However, because transitions are stochastic, we must visit these triples many times to build an accurate statistical model of the transition probabilities.

While we could build a model and then run Value Iteration, this Model-Based approach is slow—requiring extensive exploration followed by intensive computation. This leads us to Model-Free Learning: can we estimate values directly from experience without explicitly building a model?

Direct Estimation¶

Direct Estimation is the most intuitive approach to model-free learning. While it has efficiency flaws, it provides the conceptual foundation for more advanced methods. The basic idea is that we will take samples from the space of values and estimate them directly. What does that mean? Let’s start with a simple example.

Consider calculating the expected age of students in a classroom:

Model-Based: You first determine the probability distribution of all possible ages ( $P(A=18)=0.01, P(A=19)=0.10, \dots$ ) and then calculate the expectation: $E[A] = \sum_a P(A=a) \times a$ .
Model-Free: You simply sample the students directly. Since the model is derived from the population anyway, it is simpler to average the samples: $E[A] \approx \frac{1}{N} \sum_i a_i$ .

What will this mean for our MDPs? Well, we could try to estimate the values at various states by running several episodes. Each episode is a sample. In each episode, after the agent passes through a state, keep track of the discounted reward for that state throughout the episode. When the episode is done, those sums are a single sample for the value of each state. Run multiple episodes, and average.

Summary of Direct Estimation:

Simple: Easy to conceptualize and implement.
No Model Required: Does not require prior knowledge of $T$ or $R$ .
Convergence: It will eventually converge on the true values.
The Downside: It ignores the connections between states (the Markov structure). Because it treats every state as an independent estimation problem, it is extremely slow to learn.

Temporal Difference (TD) Learning¶

We want to use the sampling approach of Direct Estimation but also incorporate the state-space structure. Instead of waiting for an entire episode to finish to get one sample, we update our estimates after every single step.

We treat a single transition as a sample of the value:

sample_1 = R(s,a,s') + \gamma V^\pi(s')\\ sample_2 = R(s,a,s') + \gamma V^\pi(s')\\ etc....

(3)

And then we can make our updated value for state s be the average of the samples:

V^\pi(s) = \frac{1}{n} \sum_{i} sample_i

(4)

This can still be problematic in that we don’t know where V came from, and we’ll run a long time with bad value. What if we update V as we go? There’s a way to do this with a running average.

The Running Average¶

The basic idea is we have a stream of numbers coming in, one after the other, and we want to compute the average of them at each moment. We could keep a sum total, and at each step, divide by the number of numbers we’ve seen. But there are reasons we don’t want to do that, like if we’re going to see a lot of numbers, the sum will overflow our integer variables. Let’s assumbe the numbers come in $n0,n1,n2,...$ . The averages will be:

a_0 = n_0\\ a_1 = (n_0+n_1)/2\\ a_1 = (1 - 0.5)a_0 + 0.5n_1\\ a_2 = (n_0+n_1+n_2)/3\\ a_2 = (1 - 0.3333)a_1 + 0.3333n_2

(5)

So we can keep a running average by wieghting the previous average by (1−α) and the new number by α in such a way that α is decreasing. By adjusting the size of α we can put disproportionate weight on newer or older samples. If α stays constant, we end up weighting more recent values more.

This is a handy trick, but what does that mean for V? Well, let’s keep a running average: Given a stream of numbers, we can maintain an average by weighting the previous estimate by $(1-\alpha)$ and the new sample by $\alpha$ :

V^\pi(s) \leftarrow (1-\alpha)V^\pi(s) + \alpha(\text{sample})

(6)

This is often rewritten to show the “update” to the current value:

V^\pi(s) \leftarrow V^\pi(s) + \alpha( sample - V^\pi(s) )

(7)

V^\pi(s) \leftarrow V^\pi(s) + \alpha( R(s,a,s') + \gamma V^\pi(s') - V^\pi(s) )

(8)

Because we are looking at the difference between values at different points in time, this is called Temporal Difference (TD) learning. It is acceptable to weight newer samples more heavily because early samples are based on inaccurate, initial values of $V$ . To guarantee convergence, we decrease $\alpha$ over time, often as a function of state visits: $\alpha_t = \frac{1}{n(s_t)}$ .

This is great. It turns out that this can work really well for lots of problems, but it has one flaw: we still don’t get a policy out of it! The policy is the argmax of the Qs, but the Qs still require us to know the model (T). That will be solved next.

So is temporal difference useless?

Not completely. Recall that there were multiple problems with value iteration. One was that it required a model, but another is that is there are a very large number of states, it can take too long. TD learning can be used in problems where we DO KNOW the model, but the state space is too large to run Value Iteration. You can sample from the model and act like the model is unknown, and TD will work as if you’re just observing the world.

What are examples of this? Games. TD learning was at the core of this first AI backgammon player to play on an expert level.

Q-Learning¶

But still, we’d like to be able to use this TD technique so that we actually get a policy out of it, not just the values. To extract a policy without a model, we apply the TD technique to the Q-function rather than the Value function. Recall that:

\pi^*(s) = \arg\max_a Q^*(s,a)

(9)

Accurate Q-values give us the policy directly.

The Q-Update Rule¶

We derive the update rule by substituting the relationship $V^*(s') = \max_{a'} Q^*(s', a')$ into our TD sample:

\text{sample} = R(s,a,s') + \gamma \max_{a'} Q(s', a')

(10)

Then, as before, we’ll take our samples and incorporate them into a running average:

sample \leftarrow R(s,a,s')+ \gamma \max_{a'} Q^{\pi}(s',a')

(11)

Q^{\pi}(s,a) \leftarrow (1−\alpha)Q^{\pi}(s',a)+(\alpha)sample

(12)

The full Q-Learning update becomes:

Q^{\pi}(s,a) \leftarrow Q^{\pi}(s,a) + \alpha [R(s,a,s') + \gamma \max_{a'} Q^{\pi}(s', a') - Q^{\pi}(s,a)]

(13)

Now, we have a model-free way to learn the optimal policy. As we explore, we update our Q-values, which immediately improves our policy, which in turn makes our exploration more effective. This has another nice effect. Since the accuracy of Q-values depends on the policy being run, as we explore, we can immediately update our policy, and then our exploration will be better, and our Q-values more accurate, and then we can update our policy to be better.

The Exploration-Exploitation Trade-off¶

Q-learning is guaranteed to converge to the optimal policy as long as every state is visited an infinite number of times. That means we can run any policy as long as it visits all the states enough times eventually. But all that time we have several problems:

If we have a bad policy, we’re getting bad reward. We’re missing out!
If the bad policy happens to omit several states, we won’t ever learn about them
Infinity is a long time, and we probably want to spend our time exploring areas of the space more likely to get us good stuff.

So, we’ll definitely want to update out policy as we go. But even so, we can have bad luck, and get a policy that stays away from a part of the space that is needed. What if the environment is changing? So what we would like to do is mostly act according to policy, but sometimes take exploratory actions.

This introduces the Exploration-Exploitation trade-off:

Exploitation: Following your current policy to maximize rewards.
Exploration: Taking random or novel actions to ensure no part of the state space (and potential rewards) is missed.

Common strategies include:

$\epsilon$ -greedy: Taking a random action with probability $\epsilon$ and the best-known action with probability $1-\epsilon$ .
Exploration Functions: Using a function like $f(u, n) = u + k/n$ to favor states that have been visited fewer times ( $n$ ).