Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
| Chapter | Topic | Content |
|---|---|---|
| Basics & tabular methods | ||
| 1-5 | Bandits, MDPs, Dynamic Programming, Monte Carlo, TD Learning | RL basics in finite dimensions |
| Deep-learning-based methods | ||
| 6 | Brief introduction to deep learning | The basics for what comes next |
| 7 | Value function approximation | Value estimation with function approximation |
| 8 | Deep Q-learning | Q-learning with neural networks |
| 9 | Policy gradients | Direct optimization of the policy |
| 10 | Actor-critic algorithms | Improved policy gradients via value functions |
| 11 | Advanced algorithms (Part I): From policy gradient to PPO | The evolution of modern RL algorithms |
| 12 | Advanced algorithms (Part II): From \(Q\)-learning to Soft Actor-Critic | The evolution of moderl RL algorithms |
| Model-Based Control | ||
| Advanced Topics |
\[\begin{align*} r_t &= -s_t^2 - a_t^2 \\ \log\pi_\phi\agivenb{a_t}{s_t} &= -\frac{1}{2\sigma^2}(k s_t - a_t)^2 + C, \fragment{ \qquad \phi=(k,\sigma). } \\ \nablaphi\log\pi_\phi\agivenb{a_t}{s_t} &= \begin{pmatrix} -\frac{(k s_t - a_t)s_t}{\sigma^2} \\ \frac{(k s_t - a_t)^2}{\sigma^3} \end{pmatrix} \end{align*}\]
Closely related to ill-conditioned optimization problems: 
Source: (Peters and Schaal 2008)
💡 The KL divergence is not a real distance. For instance, it it not symmetric (i.e., \(\KLdiv{p}{q} \neq \KLdiv{q}{p}\) in general), which means that it also does not satisfy the triangle inequality. Still, it gives us a measure of how far two distributions are apart.
In simpler terms: How do variations of \(\phi\) influence the resulting probability distribution?
\[\begin{equation} \phi \gets \phi + \alpha F^{-1} \nablaphi L(\phi) \tag{\ref{eq:Adv_NPG}} \end{equation}\]
💡 For more details on this matter, see (S. Kakade and Langford 2002; Schulman et al. 2015).
\[\begin{equation} \eta(\subnew{\pi}) = \eta(\subold{\pi}) + \Expsub{A_{\subold{\pi}}(s,a)}{s \sim \rho_{\subnew{\pi}}, a\sim\subnew{\pi}} \label{eq:Adv_performance_measures} \end{equation}\]
💡 When sampling, the data will automatically be distributed according to \(\rho_{\subnew{\pi}}(s)\). See also the discussion in the actor-critic lecture.
\[\begin{equation} \eta(\subnew{\pi}) = \eta(\subold{\pi}) + \Expsub{A_{\subold{\pi}}(s,a)}{s \sim \rho_{\subnew{\pi}}, a\sim\subnew{\pi}} \tag{\ref{eq:Adv_performance_measures}} \end{equation}\]
\(\textcolor{green}{\mathbf{+}\text{ Proof that if you can find a new policy with a positive expected advantage under the old policy, your policy will get better.}}\)
\(\textcolor{red}{\mathbf{-}\text{ To compute the exact value, we must sample states from the new policy (}s \sim p_{\subnew{\pi}}\text{), which we do not yet know!}}\)
Solution approach in TRPO: “If \(\subnew{\pi}\) is very close to \(\subold{\pi}\) (enforced by a KL-constraint), we can swap the state distribution \(p_{\subnew{\pi}}\) for \(p_{\subnew{\pi}}\) and safely optimize an approximation (i.e., a surrogate objective).”
\[\begin{equation} \eta(\subnew{\pi}) = \eta(\subold{\pi}) + \Expsub{A_{\subold{\pi}}(s,a)}{s \sim \rho_{\subnew{\pi}}, a\sim\subnew{\pi}} \tag{\ref{eq:Adv_performance_measures}} \end{equation}\]
Starting with \(\eqref{eq:Adv_performance_measures}\), we need to take the several steps to arrive at an algorithm that we can actually implement.
The first realization of these steps resulted in the TRPO algorithm, which was derived in (Schulman et al. 2015).
\[\begin{equation} \eta(\subnew{\pi}) = \eta(\subold{\pi}) + \Expsub{A_{\subold{\pi}}(s,a)}{s \sim \rho_{\subnew{\pi}}, a\sim\subnew{\pi}} \tag{\ref{eq:Adv_performance_measures}} \end{equation}\]
In (Schulman et al. 2015), it was shown that error between the two expressions \(\eqref{eq:Adv_performance_measures}\) and \(\eqref{eq:Adv_surrogate_loss}\) can be bounded \[\eta(\pi) \ge L_\subold{\pi}(\pi) - C \cdot \KLdivmax{\subold{\pi}}{\pi}, \qquad \text{with}\quad\KLdivmax{\subold{\pi}}{\pi}=\max_{s\in\Sc}\KLdiv{\pi_\phi\agivenb{\cdot}{s}}{\pi_{\phi'}\agivenb{\cdot}{s}},\] and \(C\) is a constant derived from the environment’s transition dynamics and the maximum bounds of the advantage function.
Central message: If we maximize the right-hand side of this inequality, we are guaranteed to improve (or at least maintain) our true performance \(\eta(\pi)\)!
Maximizing the surrogate loss \(\Rightarrow\) introducing a parametric policy \(\pi_\phi\), our surrogate-based optimization problem is \[ \max_\phi L_\subold{\pi}(\pi_\phi) - C \cdot \KLdivmax{\subold{\pi}}{\pi_\phi}, \] or, if we consider two iterates \(\phi\) and \(\phi'\) \[ \max_{\phi'} L_{\pi_\phi}(\pi_{\phi'}) - C \cdot \KLdivmax{\pi_\phi}{\pi_{\phi'}}. \]
Challenges:
Our final observation is that in our surrogate loss function \(L_\subold{\pi}\) (Eq. \(\eqref{eq:Adv_surrogate_loss}\)), the “old” performance measure \(\eta(\subold{\pi})\) serves as a constant \(\Rightarrow\) we can neglect it!
\[\begin{equation} \max_{\phi} \underbrace{\Expsub{\frac{\pi_\phi\agivenb{a}{s}}{\subold{\pi}\agivenb{a}{s}}A_{\subold{\pi}}(s,a)}{s\sim \rho_{\subold{\pi}}, a\sim\subold{\pi}}}_{=L_\mathsf{TRPO}(\phi)} \quad\text{subject to}\quad \underbrace{\Expsub{\KLdiv{\subold{\pi}\agivenb{\cdot}{s}}{\pi_{\phi}\agivenb{\cdot}{s}}}{s \sim \rho_{\subold{\pi}}}}_{=\KLdivavg{\subold{\pi}}{\pi_{\phi}}} \leq \delta. \label{eq:Adv_TRPO_Opt} \end{equation}\]
Next question: How do we solve it efficiently? \(\Rightarrow\) Let’s start with the gradient!
\[ \nablaphi L_\mathsf{TRPO}(\phi) = \nablaphi \Expsub{\frac{\pi_\phi\agivenb{a}{s}}{\subold{\pi}\agivenb{a}{s}}A_{\subold{\pi}}(s,a)}{s\sim \rho_{\subold{\pi}}, a\sim\subold{\pi}} \fragment{ = \Expsub{\frac{\nablaphi \pi_\phi\agivenb{a}{s}}{\subold{\pi}\agivenb{a}{s}}A_{\subold{\pi}}(s,a)}{s\sim \rho_{\subold{\pi}}, a\sim\subold{\pi}} } \]
\[\begin{equation} \begin{aligned} g &= \nablaphi L_\mathsf{TRPO}(\phi) \fragment{ \Big|_{\textcolor{red}{\phi=\subold{\phi}}} } \fragment{ =\Expsub{\frac{\nablaphi \pi_{\phi}\agivenb{a}{s}\big|_{\phi=\subold{\phi}}}{\pi_\subold{\phi}\agivenb{a}{s}}A_{\pi_\subold{\phi}}(s,a)}{s\sim \rho_{\pi_\subold{\phi}}, a\sim\pi_\subold{\phi}} } \\ &=\Expsub{\nablaphi \log \pi_\phi\agivenb{a}{s}\big|_{\phi=\subold{\phi}} A_{\pi_\subold{\phi}}(s,a)}{s\sim \rho_{\pi_\subold{\phi}}, a\sim\pi_\subold{\phi}} \quad\textcolor{grey}{\text{\textit{(Log-identity trick)}}} \end{aligned} \label{eq:Adv_TRPO_gradient}\end{equation}\] is the standard policy gradient with value baseline!
To solve Problem \(\eqref{eq:Adv_TRPO_Opt}\) efficiently, we need two steps:
\[\begin{equation} \max_{\Delta\phi} g^\top\Delta\phi \quad\text{subject to}\quad \frac{1}{2}\Delta\phi^\top F \Delta\phi \leq \delta, \label{eq:Adv_TRPO_LQOpt} \end{equation}\] where, according to \(\eqref{eq:Adv_Fisher_Information_Matrix}\), \[F(\phi) = \Expsub{ \rbracket{\cbracket{\nablaphi \log\, \pi_{\phi}\agivenb{a}{s}} \cbracket{\nablaphi \log\, \pi_{\phi}\agivenb{a}{s}}^\top}_{\phi=\subold{\phi}}}{s\sim \rho_{\pi_\subold{\phi}}, a\sim\pi_\subold{\phi}}.\]
The solution (via Lagrange multipliers) yields \[\begin{equation} \Delta \phi = \sqrt{\frac{2\delta}{g^\top F^{-1} g}} F^{-1} g. \label{eq:Adv_TRPO_step} \end{equation}\] Note: This is NPG \(\eqref{eq:Adv_NPG}\) with automatic step size selection \(\alpha=\sqrt{\frac{2\delta}{g^\top F^{-1} g}}\)!
💡 In classical optimization, the local approximation \(\eqref{eq:Adv_TRPO_LQOpt}\) is solved repeatedly. The solution can be shown to converge to the optimum of \(\eqref{eq:Adv_TRPO_Opt}\), which is known as Sequential Quadratic Programming (SQP). In TRPO, we solve the approximation \(\eqref{eq:Adv_TRPO_LQOpt}\) just once before collecting new data.
\[\begin{equation} \Delta \phi = \sqrt{\frac{2\delta}{g^\top F^{-1} g}} F^{-1} g \tag{\ref{eq:Adv_TRPO_step}} \end{equation}\]
\[\begin{equation}\begin{aligned} \max_{\phi} L_\mathsf{TRPO}(\phi) \quad\text{s.t.}\quad \KLdivavg{\subold{\pi}}{\pi_{\phi}} \leq \delta. \end{aligned} \tag{\ref{eq:Adv_TRPO_Opt}} \end{equation}\]
\[\begin{equation} \max_{\Delta\phi} g^\top\Delta\phi \quad\text{subject to}\quad \frac{1}{2}\Delta\phi^\top F \Delta\phi \leq \delta, \tag{\ref{eq:Adv_TRPO_LQOpt}} \end{equation}\]
Issue: Even though we have selected an optimal step size for Problem \(\eqref{eq:Adv_TRPO_LQOpt}\) via Eq. \(\eqref{eq:Adv_TRPO_step}\), we are not guaranteed that in \(\eqref{eq:Adv_TRPO_Opt}\), we
Step length selection via backtracking. Define a shrinkage parameter \(\alpha\in(0,1)\) and
Note: Condition (II) is checked point-wise on the training batch: \[\begin{align*} &\KLdivavg{\subold{\phi}}{\subnew{\phi}}= \\ &\frac{1}{N}\sum_{i=1}^N \KLdiv{\pi_{\subold{\phi}}\agivenb{\cdot}{s_i}}{\pi_{\subnew{\phi}}\agivenb{\cdot}{s_i}}. \end{align*}\] That’s why TRPO does not apply to deterministic policies right away!
Input: Initial policy parameters \(\iterate{\phi}{0}\), trust-region bound \(\delta\), backtracking parameter \(\alpha \in (0, 1)\), acceptance threshold \(\epsilon > 0\).
For iterations \(k = 0, 1, 2, \dots\) until convergence:
Collect Data ──► Advantage estimation ──► Compute Gradient (g) ──► CG Loop (F⁻¹g) ──► Line Search (KL ≤ δ) ──► Update θ
▲ │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
As an example, we study the Half Cheetah example from the MuJoCo library.
\[\begin{equation} L_\mathsf{CLIP}(\phi) = \Exphat{\min\left( \kappa_t(\phi)A_t, \, \mathsf{clip}(\kappa_t(\phi), 1-\epsilon, 1+\epsilon)A_t \right)}, \label{eq:Adv_PPO_Objective} \end{equation}\] where we have simplified the notation in several ways:
\[L_\mathsf{TRPO}(\phi) = \E_{s\sim \rho_{\subold{\pi}}, a\sim\subold{\pi}}\Big[\underbrace{\frac{\pi_\phi\agivenb{a}{s}}{\subold{\pi}\agivenb{a}{s}}}_{=\kappa(\phi)}A_{\subold{\pi}}(s,a)\Big]\quad \text{vs.} \quad L_\mathsf{CLIP}(\phi) = \Exphat{\min\left( \kappa_t(\phi)A_t, \, \mathsf{clip}(\kappa_t(\phi), 1-\epsilon, 1+\epsilon)A_t \right)}.\]
The clipping function \(\mathsf{clip}(r_t(\phi), 1-\epsilon, 1+\epsilon)\) bounds the importance sampling ratio \(\kappa_t\) to
The \(\min\) function ensures that we take the minimum of the clipped and unclipped objective \(\Rightarrow\) \(L_\mathsf{CLIP}\) is a lower (i.e., pessimistic) bound on the unclipped objective.
\[L_\mathsf{TRPO}(\phi) = \E_{s\sim \rho_{\subold{\pi}}, a\sim\subold{\pi}}\Big[\underbrace{\frac{\pi_\phi\agivenb{a}{s}}{\subold{\pi}\agivenb{a}{s}}}_{=\kappa(\phi)}A_{\subold{\pi}}(s,a)\Big]\quad \text{vs.} \quad L_\mathsf{CLIP}(\phi) = \Exphat{\min\left( \kappa_t(\phi)A_t, \, \mathsf{clip}(\kappa_t(\phi), 1-\epsilon, 1+\epsilon)A_t \right)}.\]
Input: Initial policy parameters \(\iterate{\phi}{0}\), clipping parameter \(\epsilon\), gradient descent algorithm (Adam, RMSprop, …)
For iterations \(k = 0, 1, 2, \dots\) until convergence:
If changes are small, then we can try to use data in multiple consecutive iterations. Tradeoff:
\(\textcolor{red}{\mathbf{-}\text{ Data is not entirely on-policy.}}\qquad\) \(\textcolor{green}{\mathbf{+}\text{ More data }\Rightarrow\text{ lower variance.}}\)
As an example, we study the Half Cheetah example from the MuJoCo library.