Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
| Chapter | Topic | Content |
|---|---|---|
| Basics & tabular methods | ||
| 1-5 | Bandits, MDPs, Dynamic Programming, Monte Carlo, TD Learning | RL basics in finite dimensions |
| Deep-learning-based methods | ||
| 6 | Brief introduction to deep learning | The basics for what comes next |
| 7 | Value function approximation | Value estimation with function approximation |
| 8 | Deep \(Q\)-learning | \(Q\)-learning with neural networks |
| 9 | Policy gradients | |
| 10 | Actor-critic algorithms | |
| 11 | Advanced algorithms (Part I): From policy gradient to PPO | |
| 12 | Advanced algorithms (Part II): From \(Q\)-learning to Soft Actor-Critic | |
| Model-Based Control | ||
| Advanced Topics |
initialize
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Initialize \(s_t \gets s_0\), \(t \gets 0\)
\(\quad\) while \(s_t\) is not terminal:
\(\quad\quad\) Take action \(a_t \sim \pi(s_t)\) and observe \((r_t,s_{t+1})\)
\(\quad\quad\) Select \(a_{t+1} \sim \pi(s_{t+1})\)
\(\quad\quad\) Update \(Q\) given \((s_t,a_t,r_t,s_{t+1},a_{t+1})\): \(\quad\) \[Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \left[r_t + \gamma \textcolor{blue}{Q(s_{t+1},a_{t+1})}- Q(s_t,a_t)\right]\] \(\quad\quad\) Update policy: \(\pi = \epsilon\)-greedy\((Q)\)
\(\quad\quad\) \(t \gets t+1\)
initialize
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Initialize \(s_t \gets s_0\), \(t \gets 0\)
\(\quad\) while \(s_t\) is not terminal:
\(\quad\quad\) Take action \(a_t \sim \pi(s_t)\) and observe \((r_t,s_{t+1})\)
\(\quad\quad\) Select \(\cancel{a_{t+1} \sim \pi(s_{t+1})}\)
\(\quad\quad\) Update \(Q\) given \((s_t,a_t,r_t,s_{t+1})\): \(\quad\) \[Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \left[r_t + \gamma \textcolor{red}{\max_a Q(s_{t+1},a)}- Q(s_t,a_t)\right]\] \(\quad\quad\) Update policy \(\pi = \epsilon\)-greedy\((Q)\)
\(\quad\quad\) \(t \gets t+1\)
Today’s focus: value-based control \(\Rightarrow\) How to transfer tabular methods to the function approximation case.
The transfer from value function approximation is straightforward!
Input:
Initialize: Value function weights \(\theta\in\R^d\) arbitrarily
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Generate a sequence following \(\pi\) (or \(\epsilon\)-greedy\((Q_\theta)\)): \[((s_0,a_0,r_0),(s_1,a_1,r_1),\ldots,(s_{T_k-1},a_{T_k-1},r_{T_k-1}))\] \(\quad\) Calculate the every-visit returns \(g_t\)
\(\quad\) for \(t = 0,1,\ldots,T-1\):
\(\quad\quad\) \(\theta \gets \theta + \alpha\rbracket{g_t - Q_\theta(s_t,a_t)} \nablatheta Q_\theta(s_t,a_t)\)
Input:
Initialize: Value function weights \(\theta\in\R^d\) arbitrarily
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Initialize \(s_0\)
\(\quad\) for \(t = 0,1,\ldots,T-1\):
\(\quad\quad\) Obtain \(\textcolor{red}{a_t \sim \policy{\cdot}{s_t}}\) (or \(a_t \sim \epsilon\)-greedy\((Q_\theta(s_t,\cdot))\))
\(\quad\quad\) Observe \(r_t\) and \(s_{t+1}\)
\(\quad\quad\) if \(s_{t+1}\) is \(\terminal\) then
\(\quad\quad\quad\) \(\theta \gets \theta + \alpha\rbracket{r_t - Q_\theta(s_t,a_t)} \nablatheta Q_\theta(s_t,a_t)\)
\(\quad\quad\quad\) Go to next episode \(k+1\)
\(\quad\quad\) Choose \(a_{t+1} \sim \policy{\cdot}{s_{t+1}}\) (or \(a_{t+1} \sim \epsilon\)-greedy\((Q_\theta(s_{t+1},\cdot))\))
\(\quad\quad\) \(\theta \gets \theta + \alpha\rbracket{r_t + \gamma Q_\theta(s_{t+1},a_{t+1}) - Q_\theta(s_t,a_t)} \nablatheta Q_\theta(s_t,a_t)\)
Approximation: Linear model with tile coding features \(x_i(s,a)\), where \[ \begin{align*} \psi(s,a)&=\sum_{i=1}^d \theta_i x_i(s,a), \\ x_i(s,a)&=\begin{cases} 1 & (s,a)\in\text{ tile \#}i \\ 0 & \text{otherwise} \end{cases}. \end{align*} \]
But with two important distinctions
The motivation behind this
Input:
Initialize: Q function weights \(\theta = \theta' \in\R^d\) arbitrarily, \(\Dc=\{\}\)
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Initialize \(s_0\)
\(\quad\) for \(t = 0,1,\ldots,T-1\):
\(\quad\quad\) Obtain \(a_t \sim \epsilon\)-greedy\((Q_\theta(s_t,\cdot))\) and observe \(r_t\) and \(s_{t+1}\)
\(\quad\quad\) Store tuple \((s_t,a_t,r_t,s_{t+1})\) in \(\Dc\)
\(\quad\quad\) Sample minibatch \(\Bc\) from \(\Dc\) (after initial warm-up)\(\qquad\qquad\qquad\)
\(\quad\quad\) Compute targets for \((s_i,a_i,r_i,s'_i)\in\Bc\) \[y_i = r_i + \max_{a\in\Ac}Q_{\theta'}(s_i',a) \qquad\text{(or}~y_i = r_i~\text{if $s'_{i}$ is terminal)}\] \(\quad\quad\) Update \(\theta\) via gradient descent on \(L(\theta)\) with learning rate \(\alpha\)
\(\quad\quad\) if \(t\mod{N_\theta}=0\) then \(\theta'\gets\theta\)
Remarks:

100,000 steps
300,000 steps
500,000 steps
800,000 steps
1,000,000 steps
1,500,000 steps
2,000,000 steps
💡 Polyak averaging originates from optimization theory where in SGD, we update \(\theta\) by averaging over recent iterates \(\iterate{\theta}{t}, \iterate{\theta}{t-1}, \iterate{\theta}{t-2}, \ldots\).
Remember the maximization bias and double \(Q\)-learning?
\(\Rightarrow\) Don’t use same network to choose the action and estimate the value!
Double deep \(Q\)-learning:


The concept of TD(\(n\)) can be extended to deep \(Q\)-learning in a straightforward fashion: \[ y_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{n-1} r_{t+n-1} + \gamma^{n} \max_{a\in\Ac} Q_{\theta}(s_{t+n},a). \]
\(\textcolor{green}{\mathbf{+}\text{ less biased target values when estimates using }Q_\theta\text{ are inaccurate}}\)
\(\textcolor{green}{\mathbf{+}\text{ typically faster learning, especially early on}}\)
\(\textcolor{red}{\mathbf{-}\text{ only actually correct when learning on policy}}\)
What’s the problem with continuous actions in DQN?
\(\Rightarrow\) if \(\Ac\) is continuous (i.e., \(a\in\R^m\)), then the maximization becomes a challenging optimization probelm in itself!
What can we do?
Following Sergey Levine’s CS285 lecture:
(Schaul et al. 2016)General idea:
Some remarks:
Example: Pendulum on a cart (Abdelwanis et al. 2026)
