Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
| Chapter | Topic | Content |
|---|---|---|
| Basics & tabular methods | ||
| 1 | Multi-armed bandits | Exploration-exploitation dilemma |
| 2 | Markov decision processes | Dynamics, rewards, policies |
| 3 | Dynamic programming | Optimal decision making with full knowledge |
| 4 | Monte Carlo methods | Data-driven learning from entire episodes |
| 5 | Temporal difference learning & \(Q\)-learning | Data-driven learning from individual transitions |
| Deep-learning-based methods | ||
| Model-Based Control | ||
| Advanced Topics |
The main aspects of our previous two categories of methods:
The concept of temporal difference (TD) learning is to combine both:
💡 We are still considering finite MDPs here.
Recall the incremental update of the type we discussed for multi-armed bandits or for MC control
\(\Rightarrow\) \(\mathsf{NewEstimate} \gets \mathsf{OldEstimate} + StepSize \; [ \target - \mathsf{OldEstimate} ]\): \[
\begin{equation}
V(s_t) \gets V(s_t) + \alpha \left[g_t - V(s_t)\right]. \label{eq:TD_MC-update}
\end{equation}
\]
\[ \begin{equation} V(s_t) \gets V(s_t) + \alpha \left[\textcolor{red}{r_t + \gamma V(s_{t+1})} - V(s_t)\right]. \label{eq:TD_TD0-update} \end{equation} \]
Parameters: Step size \(\alpha\in (0,1)\)
Initialize: \(V(s)\) arbitrarily for \(s \in \Sc\), \(V(\terminal) = 0\)
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) for \(t = 0,1,\ldots,T\):
\(\quad\quad\) Sample action \(a_t \sim \pias\) and apply
\(\quad\quad\) Observe \(s_{t+1}\) and \(r_{t}\)
\(\quad\quad\) \(V(s_t) \gets V(s_t) + \alpha \left[r_t + \gamma V(s_{t+1}) - V(s_t)\right]\) (Eq. \(\eqref{eq:TD_TD0-update}\))
\(\quad\quad\) if \(s_{t+1}\) is terminal then \(\STOP\)
💡 The algorithm can directly be applied to the prediction of action-value functions.
Using the target, we can define the TD error\(^*\) \(\delta\) \(\Rightarrow\) the expression inside the bracket of Eq. \(\eqref{eq:TD_TD0-update}\) we’re using to improve our estimate: \[ \delta_t = \target - V(s_t) = r_t + \gamma V(s_{t+1}) - V(s_t). \]
Batch mode: Let’s assume that the TD(\(0\)) estimate of \(V(s)\) does not change within an episode, but that we apply all updates simulatenously once the episode is finished (exactly as we need to do in MC prediction): \[ \begin{align*} \underbrace{g_t - V(s_t)}_{\text{MC error}} &= r_{t}+\gamma g_{t+1} - V(s_t) \fragment{+\gamma V(s_{t+1}) - \gamma V(s_{t+1})}\\ &= \delta_t + \gamma (\underbrace{g_{t+1} - V(s_{t+1})}_{\text{MC error}}) \fragment{= \delta_t + \gamma \delta_{t+1} + \gamma^2(g_{t+2} - V(s_{t+2}))} \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \gamma^3(g_{t+3} - V(s_{t+3})) \fragment{= \ldots = \sum_{k=t}^T\gamma^{k-t} \delta_k.} \end{align*} \]
\(^*\) For sufficiently small step size \(\alpha\), it can be shown that the TD error conveges to zero in batch mode (Sutton and Barto 1998).
Given a finite MDP and a fixed policy \(\pi\), the state-value estimate \(V\) of TD(\(0\)) converges to the true \(V^\pi\)
(that is, \(\lim_{t\rightarrow\infty} \delta_t = 0\))
Imagine you want to predict your time to get home after work, starting with an estimate of 30 minutes.
| State | Elapsed Time (minutes) | Predicted Time to Go | Predicted Total Time |
|---|---|---|---|
| leaving office, friday at 6 | 0 | 30 | 30 |
| reach car, raining | 5 | 35 | 40 |
| exiting highway | 20 | 15 | 35 |
| 2ndary road, behind truck | 30 | 10 | 40 |
| entering home street | 40 | 3 | 43 |
| arrive home | 43 | 0 | 43 |
MC vs. TD estimate.
Assume that we have an MDP with only two states: \(A\) and \(B\). We do not consider discounting (i.e., \(\gamma=1\)).
We collect eight samples from the dynamics, and we would like to estimate the values \(V(A)\) and \(V(B)\):
| Episode \(k\) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| Sequence (\(s,r,s,r\)) | A, 0, B, 0 | B, 1 | B, 1 | B, 1 | B, 1 | B, 1 | B, 1 | B, 0 |
Question: What’s \(V(A)\) and \(V(B)\) based on
Before approaching this, let’s try to model the underlying MDP!
Answer 1. is the MC estimate, answer 2. is the TD estimate!
Let’s formally calculate the MC and TD updates:
\[ \begin{align*} V(s_t) &\gets V(s_t) + \alpha \left[g_t - V(s_t)\right] &&\text{(MC)} \\ V(s_t) &\gets V(s_t) + \alpha \left[r_t + \gamma V(s_{t+1}) - V(s_t)\right] \qquad\qquad &&\text{(TD)} \end{align*} \]
Convergence: no change in the estimate, i.e., \[ \begin{align*} \alpha \left[g_t - V(s_t)\right] = g_t - V(s_t) &= 0 &&\text{(MC)} \\ \alpha \left[r_t + \gamma V(s_{t+1}) - V(s_t)\right] = r_t + \gamma V(s_{t+1}) - V(s_t) &= 0 \qquad &&\text{(TD)} \end{align*} \]
Consider a batch sweep over the the \(K\) episodes: \[ \begin{align*} \sum_{k=1}^K g_{t,k} - V(s_{t,k}) &= 0 &&\text{(MC)} \\ \sum_{k=1}^K r_{t,k} + \gamma V(s_{t+1,k}) - V(s_{t,k}) &= 0 \qquad &&\text{(TD)} \end{align*} \]
Apply the previous equations first to state \(\mathbf B\). Since \(B\) is a terminal state, \(V(s_{t+1}) = 0\) and \(g_{t,k} = r_{t,k}\) apply, i.e., the MC and TD updates are identical for \(B\): \[ \begin{align*} &\text{MC}\big|_{s=B}~\text{:}\qquad \sum_{k=1}^K g_{t,k} - V(B) = 0 \qquad &\Leftrightarrow \qquad V(B)=\frac{1}{K} \sum_{k=1}^K r_{t,k} = \frac{6}{8} = \frac{3}{4}, \\ &\text{TD}\big|_{s=B}~\text{:}\qquad \sum_{k=1}^K r_{t,k} - V(B) = 0 \qquad &\Leftrightarrow \qquad V(B)=\frac{1}{K} \sum_{k=1}^K r_{t,k} = \frac{6}{8} = \frac{3}{4}. \end{align*} \]
Now consider state \(\mathbf A\), where we have only one trajectory:
\[ \begin{align*} &\text{MC}\big|_{s=A}~\text{:}\qquad \sum_{k=1}^K g_{t,k} - V(A) = 0 \qquad &&\Leftrightarrow \qquad V(A)= 0, \\ &\text{TD}\big|_{s=A}~\text{:}\qquad \sum_{k=1}^K r_{t,k} + \gamma V(B) - V(A) =\sum_{k=1}^K \gamma V(B) - V(A) = 0 \qquad &&\Leftrightarrow \qquad V(A)= \gamma V(B) = \frac{3}{4}. \end{align*} \]
Where does this difference come from? \(\Rightarrow\) Without going into a detailed derivation:
\(\Rightarrow\) TD assumes an MDP problem structure and is absolutely certain that its internal model concept describes the real world perfectly (so-called certainty equivalence).
Consider a simple Markov reward process (MRP; no actions), for which we want to estimate \(V(s)\) from experience only.
The example is taken from (Sutton and Barto 1998, Ex. 6.2)
Why is MC learning faster? Wouldn’t we expect TD(\(0\)) to be stronger?
\(\Rightarrow\) We update entire trajectories, not just the last field we saw before the target.
\(\Rightarrow\) Also, we started with a poor initializaiton for \(V\).
The GPI concept can directly be applied to the TD framework by replacing values with action-values, where we alternate between estimation (i.e., prediction) and improvement: \[ \pi_0 \stackrel{E}{\longrightarrow} Q^{\pi_0} \stackrel{I}{\longrightarrow} \pi_1 \stackrel{E}{\longrightarrow} Q^{\pi_1} \ldots \stackrel{I}{\longrightarrow} \pi^* \stackrel{E}{\longrightarrow} Q^*=Q^{\pi^*}. \]
\[ \begin{equation} Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \underbrace{\big[\overbrace{r_t + \gamma Q(s_{t+1}, a_{t+1})}^{\mathsf{TD~target}~y} - Q(s_t, a_t)\big]}_{\mathsf{TD~error}~\delta}. \label{eq:TD_TD0-update-Q} \end{equation} \]
The policy: \(\epsilon\)-greedy / \(\epsilon\)-soft\(^*\).
Convergence (Sutton and Barto 1998)
Key question: how do we select \(a_{t+1}\)? \(\Rightarrow\) On-policy versus off policy!
\(^*\) $ π ( a | s )> 0$ for all \(s\in\Sc\) and \(a\in\Ac\).
\(s,a,r,s',a'\): State-Action-Reward-Next State-Next Action
initialize
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Initialize \(s_t \gets s_0\), \(t \gets 0\)
\(\quad\) while \(s_t\) is not terminal:
\(\quad\quad\) Take action \(a_t \sim \pi(s_t)\) and observe \((r_t,s_{t+1})\) \(\qquad\qquad\qquad~\)
\(\quad\quad\) Select \(a_{t+1} \sim \pi(s_{t+1})\)
\(\quad\quad\) Update \(Q\) given \((s_t,a_t,r_t,s_{t+1},a_{t+1})\): \(\quad\) \[Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \left[r_t + \gamma Q(s_{t+1},a_{t+1})- Q(s_t,a_t)\right]\] \(\quad\quad\) Update policy: \(\pi = \epsilon\)-greedy\((Q)\) \(\qquad\) (This is on-policy!)
\(\quad\quad\) \(t \gets t+1\)
Convergence: SARSA for finite-state and finite-action MDPs converges to the optimal action-value, \[Q(s, a) \to Q^*(s, a),\] under the following conditions:
initialize
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Initialize \(s_t \gets s_0\), \(t \gets 0\)
\(\quad\) while \(s_t\) is not terminal:
\(\quad\quad\) Take action \(a_t \sim \pi(s_t)\) and observe \((r_t,s_{t+1})\)
\(\quad\quad\) Select \(\cancel{a_{t+1} \sim \pi(s_{t+1})}\)\(\qquad\qquad\qquad\qquad\) (Not needed in off-policy)
\(\quad\quad\) Update \(Q\) given \((s_t,a_t,r_t,s_{t+1})\): \(\quad\) \[Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \left[r_t + \gamma \textcolor{red}{\max_a Q(s_{t+1},a)}- Q(s_t,a_t)\right]\] \(\quad\quad\) Update policy \(\pi = \epsilon\)-greedy\((Q)\)
\(\quad\quad\) \(t \gets t+1\)
Outcome:
In our setting, \(Q\)-learning is worse than SARSA. Why do you think that is?
\(\Rightarrow\) the combination of falling off randomly and the resulting very large negative reward.
The example is taken from (Sutton and Barto 1998, Ex. 6.6)
What can we do to reduce the large variance in the SARSA update \[Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \left[r_t + \gamma \textcolor{red}{Q(s_{t+1},a_{t+1})}- Q(s_t,a_t)\right]?\]
Let’s take the expectation of Q \(\rightarrow\) expected SARSA: \[ \begin{align*} Q(s_t,a_t) &\gets Q(s_t,a_t) + \alpha \left[r_t + \gamma \textcolor{blue}{\ExpCsub{Q(s_{t+1},a)}{s_t}{\pi}}- Q(s_t,a_t)\right] \\ &= Q(s_t,a_t) + \alpha \left[r_t + \gamma \left(\sum_{a\in\Ac} \policy{a}{s_{t+1}}Q(s_{t+1},a)\right)- Q(s_t,a_t)\right]. \end{align*} \]
For comparison: \(Q\)-learning \[Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \left[r_t + \gamma \mathbf{\max_{a\in\Ac} Q(s_{t+1},a)}- Q(s_t,a_t)\right]\]
All control algorithms discussed so far involve maximization operations:
This can lead to a significant positive bias:
Split the learning process:
Assign specific tasks to each estimate:
initialize
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Initialize \(s_t \gets s_0\), \(t \gets 0\)
\(\quad\) while \(s_t\) is not terminal:
\(\quad\quad\) Take action \(a_t\) using an \(\epsilon\)-greedy policy on \(Q_1 + Q_2\) and observe \((r_t,s_{t+1})\) \(\qquad\qquad\qquad\qquad\qquad~\)
\(\quad\quad\) if \(\mathsf{rand()} > 0.5\) then \(\quad\) \[Q_1(s_t,a_t) \gets Q_1(s_t,a_t) + \alpha \left[r_t + \gamma Q_2(s_{t+1},\arg\max_{a} Q_1(s_{t+1},a))- Q_1(s_t,a_t)\right]\] \(\quad\quad\) else: \(\quad\) \[Q_2(s_t,a_t) \gets Q_2(s_t,a_t) + \alpha \left[r_t + \gamma Q_1(s_{t+1},\arg\max_{a} Q_2(s_{t+1},a))- Q_2(s_t,a_t)\right]\] \(\quad\quad\) \(t \gets t+1\)
Recap the update targets for the incremental prediction methods:
The generalized \(n\)-step target is defined as: \[ g_{t:t+n} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{n-1} r_{t+n-1} + \gamma^{n} V_{t+n-1}(s_{t+n}) . \]
The state-value estimate using the \(n\)-step return approximation is: \[ V_{t+n}(s_t) = V_{t+n-1}(s_t) + \alpha \left[ g_{t:t+n} - V_{t+n-1}(s_{t}) \right], \qquad 0 \leq t \leq T. \]
input. a policy \(\pi\) to be evaluated, parameter: step size \(\alpha \in (0, 1]\), prediction steps \(n \in \Z^+\).
Initialize: \(V(s)\) arbitrarily for \(s \in \Sc\), \(V(\terminal) = 0\).
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) initialize and store \(s_0\)
\(\quad\) \(T \leftarrow \infty\), \(\tau \leftarrow 0\)
\(\quad\) for \(t=1,2,3,\ldots\)
\(\quad\quad\) if \(t<T\) then
\(\quad\quad\quad\) take action \(a_t \sim \policy{\cdot}{s_t}\), observe and store \(s_{t+1}\) and \(r_{t+1}\)
\(\quad\quad\quad\) if \(s_{t+1}\) is \(\terminal\) then \(T \leftarrow k + 1\)
\(\quad\quad\) \(\tau\leftarrow t-n-1\) (\(\tau\) is the time index for the estimate update)
\(\quad\quad\) if \(\tau\geq 0\) then
\(\quad\quad\quad\) \(g \leftarrow \sum_{k=\tau+1}^{\min(\tau + n,T)} \gamma^{k-\tau-1} r_k\)
\(\quad\quad\quad\) if \(\tau + n < T\) then \(g \leftarrow g + \gamma^n V(s_{\tau+n})\)
\(\quad\quad\quad\) \(V(s_\tau) \leftarrow V(s_\tau) + \alpha \left[g - V(s_{\tau})\right]\)
\(\quad\quad\) if \(\tau = T − 1\) then \(\STOP\)
Analog to \(n\)-step TD, the state-action value target (with \(a_{t+n}\sim\policy{\cdot}{s_{t+n}}\)) is rewritten as: \[ \begin{equation} g_{t:t+n} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{n-1} r_{t+n-1} + \textcolor{red}{\gamma^{n} Q_{t+n-1}(s_{t+n},a_{t+n})}. \label{eq:TD_nstep-onpolicy-return} \end{equation} \]
For \(n\)-step expected SARSA, the update is similar but we’re considering the expected value at \(t+n\): \[ \begin{equation} g_{t:t+n} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{n-1} r_{t+n-1} + \textcolor{red}{\gamma^{n} \sum_{a\in\Ac} \policy{a}{s_{t+1}} Q_{t+n-1}(s_{t+n},a)}. \label{eq:TD_nstep-expected-return} \end{equation} \]
💡 Again, if an episode terminates within the lookahead horizon (\(t + n \geq T\)) the target is equal to the Monte Carlo update: \(g_{t:t+n} = g_t\).
Finally, the modified n-step targets can be directly integrated to the state-action value estimate update rule of SARSA:
\[ Q_{t+n}(s_t, a_t) = Q_{t+n-1}(s_t, a_t) + \alpha \left[ g_{t:t+n} - Q_{t+n-1}(s_{t}, a_t) \right], \qquad 0 \leq t \leq T, \]
with \(g_{t:t+n}\) according to \(\eqref{eq:TD_nstep-onpolicy-return}\) for the on-policy, and \(\eqref{eq:TD_nstep-expected-return}\) for the expected version of SARSA.
For the off-policy version of \(n\)-step SARSA, we need to adapt the importance sampling weights we have seen in MC off-policy control to the \(n\)-step case, i.e., for various prediction lengths \(h\in\{t+1,\ldots,t+n-1\}\): \[ \rho_{t:h} = \prod_{k=t}^{\min(t+h, T)}\frac{\policy{a_k}{s_k}}{b\agivenb{a_k}{s_k}}. \] For more details, see (Sutton and Barto 1998, Chapter 7.3).
General idea: