Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
| Chapter | Topic | Content |
|---|---|---|
| Basics & tabular methods | ||
| 1-5 | Bandits, MDPs, Dynamic Programming, Monte Carlo, TD Learning | RL basics in finite dimensions |
| Deep-learning-based methods | ||
| 6 | Brief introduction to deep learning | The basics for what comes next |
| 7 | Value function approximation | Value estimation with function approximation |
| 8 | Deep \(Q\)-learning | |
| 9 | Policy gradients | |
| 10 | Actor-critic algorithms | |
| 11 | Advanced algorithms (Part I): From policy gradient to PPO | |
| 12 | Advanced algorithms (Part II): From \(Q\)-learning to Soft Actor-Critic | |
| Model-Based Control | ||
| Advanced Topics |
We have essentially “solved” the RL problem in part one of our lecture! But there are quite a few limitations when it comes to real-world systems:
\(^*\) For now, however, we will stick to finite action spaces.
We use \(n\) to denote the number of state variables, but this does not mean that we need to have \(n=\infty\) for continuous state spaces. A state space is finite if each state variable \(s_k\) is an element of a finite set, whereas it is continous, if at least one \(s_1,\ldots,s_n\) is a continuous variable.
Updates due to new experience lead to updates in our model parameter \(\theta\). This means that we modify our value estimate for many states, in contrast to tabular methods where we update individual states only.
đź’ˇ We will use greek letters (e.g., \(\theta\) or \(\phi\)) to indicate the trainable parameters.
The RL prediction objective is defined as the mean squared value error \[ \overline{VE}(\theta) = \int_\Sc \mu(s) \big(V^\pi(s) - V_\theta(s)\big)^2 \ds. \] Here, \(\mu(s)\geq 0\) is the state distribution, with \(\int_\Sc \mu(s)\ds = 1\).
Challenge:
As \(\overline{VE}(\theta)\) is the expected error, we can use Monte Carlo sampling to estimate it:
\[\begin{equation} L(\theta) = \sum_{k=1}^N \big(V^\pi(s_k) - V_\theta(s_k)\big)^2 \approx \overline{VE}(\theta). \label{eq:VAL_L}\end{equation}\]
Goal: find the optimal value estimate \(V_{\theta^*}\) by optimization: \[\theta^* = \arg\min_\theta L(\theta).\]
Challenges:
đź’ˇ Off-policy prediction: importance sampling to transform the sampled value estimates from the behavior to the target policy.\(\qquad\quad\)
\(\Rightarrow\) Increases the prediction variance; in combination with generalization errors due to function approximation: large risk of diverging.
Given a new sample \(s_t\) and its (for now exact) value \(V^\pi(s_t)\), we can update our current weight vector \(\iterate{\theta}{t}\) by using stochastic gradient descent (SGD) with: \[\begin{equation}\iterate{\theta}{t+1} = \iterate{\theta}{t} - \frac{1}{2} \alpha\nablatheta \rbracket{\cbracket{V^\pi(s_t) - V_\theta(s_t)}^2} \fragment{=\iterate{\theta}{t} + \alpha\rbracket{V^\pi(s_t) - V_\theta(s_t)} \nablatheta V_\theta(s_t).} \label{eq:VAL_SGD-update} \end{equation}\]
Question: Should we follow \(\eqref{eq:VAL_SGD-update}\) to optimality?
\(\Rightarrow\) No! We are optimizing for a single sample \(\rightarrow\) overfitting.
\(\Rightarrow\) Take a single, small step only, then collect new experience.
Challenge 1. remains: We do not know \(V^\pi(s_t)\) which we need for our prediction objective \(\eqref{eq:VAL_L}\).
What can we do? \(\Rightarrow\) Use an estimate for \(V^\pi(s_t)\) that we can compute from experience!
Possible options for estimates:
Monte Carlo estimates for \(g_t\), sample from entire trajectories.
đź’ˇ Inserting the bootstrapped versions into \(\eqref{eq:VAL_SGD-update}\) does not yield a true gradient descent method, as we are ignoring the parametric dependency of the target on \(\theta\) (Sutton and Barto 1998). More on this on the next slides.
Input:
Initialize: Value function weights \(\theta\in\R^d\) arbitrarily
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Generate a sequence following \(\pi\): \[((s_0,a_0,r_0),(s_1,a_1,r_1),\ldots,(s_{T_k-1},a_{T_k-1},r_{T_k-1}))\] \(\quad\) Calculate the every-visit returns \(g_t\)
\(\quad\) for \(t = 0,1,\ldots,T-1\):
\(\quad\quad\) \(\theta \gets \theta + \alpha\rbracket{g_t - V_\theta(s_t)} \nablatheta V_\theta(s_t)\)
\[\begin{equation} \theta \gets\theta + \alpha\rbracket{r_t + \gamma V_\theta(s_{t+1}) - V_\theta(s_t)} \nablatheta V_\theta(s_t). \label{eq:VAL_TDsemigradient} \end{equation}\]
đź’ˇ Even though widely applied, there are no convergence guarantees for semi-gradient methods (Sutton and Barto 1998).
Input:
Initialize: Value function weights \(\theta\in\R^d\) arbitrarily
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) Initialize \(s_0\)
\(\quad\) for \(t = 0,1,\ldots,T-1\):
\(\quad\quad\) Apply action \(a_t\sim \policy{\cdot}{s_t}\)
\(\quad\quad\) Observe \(r_t\) and \(s_{t+1}\)
\(\quad\quad\) \(\theta \gets \theta + \alpha\rbracket{r_t + \gamma V_\theta(s_{t+1}) - V_\theta(s_t)} \nablatheta V_\theta(s_t)\)
\(\quad\quad\) if \(s_{t+1}\) is \(\mathsf{terminal}\) then \(\STOP\)
The above algorithm can be extended towards \(n\)-step bootstrapping by replacing the TD(\(0\)) target \(r_t + \gamma V_\theta(s_{t+1})\) with the \(n\)-step bootstrapped return \[ \begin{align*} g_{t:t+n} = &r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots \\ &+ \gamma^{n-1} r_{t+n-1} + \gamma^n V_{\theta}(s_{t+n}) . \end{align*} \] The most challenging part in the implementation is the assessment of the remaining number of steps (i.e., when \(T-t<n\)).
For details, see (Sutton and Barto 1998, Ch. 9.4).
Based on a dataset \(\Dc=\set{(s_0, V^\pi(s_0)),\ldots,(s_N, V^\pi(s_N))}\), repeat
Remarks: