Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
The main aspects of our previous two categories of methods:
The concept of temporal difference (TD) learning is to combine both:
💡 We are still considering finite MDPs here.
Recall the incremental update of the type we discussed for multi-armed bandits or for MC control
\(\Rightarrow\) \(\mathsf{NewEstimate} = \mathsf{OldEstimate} + StepSize \; [ \mathsf{Target} - \mathsf{OldEstimate} ]\): \[
\begin{equation}
V(s_t) = V(s_t) + \alpha \left[g_t - V(s_t)\right]. \label{eq:TD_MC-update}
\end{equation}
\]
\[ \begin{equation} V(s_t) = V(s_t) + \alpha \left[\textcolor{red}{r_t + \gamma V(s_{t+1})} - V(s_t)\right]. \label{eq:TD_TD0-update} \end{equation} \]
Parameters: Step size \(\alpha\in (0,1)\)
Initialize: \(V(s)\) arbitrarily for \(s \in \Sc\), \(V(\mathsf{terminal}) = 0\)
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) for \(t = 0,1,\ldots,T\):
\(\quad\quad\) Sample action \(a_t \sim \pias\) and apply
\(\quad\quad\) Observe \(s_{t+1}\) and \(r_{t}\)
\(\quad\quad\) \(V(s_t) = V(s_t) + \alpha \left[r_t + \gamma V(s_{t+1}) - V(s_t)\right]\) (Eq. \(\eqref{eq:TD_TD0-update}\))
\(\quad\quad\) if \(s_{t+1}\) is terminal then STOP
💡 The algorithm can directly be applied to the prediction of action-value functions.
Using the target, we can define the TD error\(^*\) \(\delta\) \(\Rightarrow\) the expression inside the bracket of Eq. \(\eqref{eq:TD_TD0-update}\) we’re using to improve our estimate: \[ \delta_t = \mathsf{target} - V(s_t) = r_t + \gamma V(s_{t+1}) - V(s_t). \]
Batch mode: Let’s assume that the TD(0) estimate of \(V(s)\) does not change within an episode, but that we apply all updates simulatenously once the episode is finished (exactly as we need to do in MC prediction): \[ \begin{align*} \underbrace{g_t - V(s_t)}_{\text{MC error}} &= r_{t}+\gamma g_{t+1} - V(s_t) \fragment{+\gamma V(s_{t+1}) - \gamma V(s_{t+1})}\\ &= \delta_t + \gamma (\underbrace{g_{t+1} - V(s_{t+1})}_{\text{MC error}}) \fragment{= \delta_t + \gamma \delta_{t+1} + \gamma^2(g_{t+2} - V(s_{t+2}))} \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \gamma^3(g_{t+3} - V(s_{t+3})) \fragment{= \ldots = \sum_{k=t}^T\gamma^{k-t} \delta_k.} \end{align*} \]
\(^*\) For sufficiently small step size \(\alpha\), it can be shown that the TD error conveges to zero in batch mode (Sutton and Barto 1998).
Imagine you want to predict your time to get home after work, starting with an estimate of 30 minutes.
| State | Elapsed Time (minutes) | Predicted Time to Go | Predicted Total Time |
|---|---|---|---|
| leaving office, friday at 6 | 0 | 30 | 30 |
| reach car, raining | 5 | 35 | 40 |
| exiting highway | 20 | 15 | 35 |
| 2ndary road, behind truck | 30 | 10 | 40 |
| entering home street | 40 | 3 | 43 |
| arrive home | 43 | 0 | 43 |
MC vs. TD estimate
asf
A learning policy \(\pi\) is called GLIE if it satisfies the following two properties:
If a state is visited inifinitely often, then each action is choosen inifintely often: \[ \lim_{i \to \infty} \pi_i(a|s) = 1 \quad \forall \{s \in \Sc, a \in \Ac\}\]
In the limit (\(i \to \infty\)) the learning policy is greedy with respect to the learned action value: \[ \lim_{i \to \infty} \pi_i(a|s) = \pi(s) = \arg \max_{a \in \Ac} Q(s,a) \quad \forall s \in \Sc \]
MC-based contro using \(\epsilon\)-greedy exploration is GLIE, if \(\epsilon\) is decreased at a rate \[ \epsilon_i = \frac{1}{i} \] with \(i\) being the increasing episode index. In this case, \[ \hat Q(s,a) = Q^*(s,a) \] follows.
initialize
for \(j = 1, 2, \ldots, J\) episodes:
\(\quad\) Initialize \(s_t \gets s_0\), \(t \gets 0\)
\(\quad\) while \(s_t\) is not terminal:
\(\quad\quad\) Take action \(a_t \sim \pi(s_t)\) and observe \((r_t,s_{t+1})\)
\(\quad\quad\) Select \(a_{t+1} \sim \pi(s_{t+1})\)
\(\quad\quad\) Update \(Q\) given \((s_t,a_t,r_t,s_{t+1},a_{t+1})\): \(\quad\) \[Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \left[r_t + \gamma Q(s_{t+1},a_{t+1})- Q(s_t,a_t)\right]\] \(\quad\quad\) Update policy \(\pi = \epsilon\)-greedy\((Q)\)
\(\quad\quad\) \(t \gets t+1\)
[Based on Marius Lindauer’s lecture]
SARSA for finite-state and finite-action MDPs converges to the optimal action-value, \(Q(s, a) \to Q^*(s, a)\), under the following conditions:
For example, \(\alpha_t = \frac{1}{t}\) satisfies the above condition.
initialize
for \(j = 1, 2, \ldots, J\) episodes:
\(\quad\) Initialize \(s_t \gets s_0\), \(t \gets 0\)
\(\quad\) while \(s_t\) is not terminal:
\(\quad\quad\) Take action \(a_t \sim \pi(s_t)\) and observe \((r_t,s_{t+1})\)
\(\quad\quad\) Select \(a_{t+1} \sim \pi(s_{t+1})\)
\(\quad\quad\) Update \(Q\) given \((s_t,a_t,r_t,s_{t+1},a_{t+1})\): \(\quad\) \[Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \left[r_t + \gamma \textcolor{red}{\max_a Q(s_{t+1},a)}- Q(s_t,a_t)\right]\] \(\quad\quad\) Update policy \(\pi = \epsilon\)-greedy\((Q)\)
\(\quad\quad\) \(t \gets t+1\)