Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
| Chapter | Topic | Content |
|---|---|---|
| Basics & tabular methods | ||
| 1-5 | Bandits, MDPs, Dynamic Programming, Monte Carlo, TD Learning | RL basics in finite dimensions |
| Deep-learning-based methods | ||
| 6 | Brief introduction to deep learning | The basics for what comes next |
| 7 | Value function approximation | Value estimation with function approximation |
| 8 | Deep \(Q\)-learning | \(Q\)-learning with neural networks |
| 9 | Policy gradients | Direct optimization of the policy |
| 10 | Actor-critic algorithms | Improved policy gradients via value functions |
| 11 | Advanced algorithms (Part I): From policy gradient to PPO | |
| 12 | Advanced algorithms (Part II): From \(Q\)-learning to Soft Actor-Critic | |
| Model-Based Control | ||
| Advanced Topics |
The expected value of a function \(f(s)\) is
\[ \Expsub{f(s)}{s\sim p} = \int p(s) f(s) \ds. \]
Here, \(p\) is the density according to which \(s\) is distributed, with \(\int p(s) \ds = 1\).
\[\nabla_\phi \piphi\agivenb{a}{s} = \piphi\agivenb{a}{s} \nabla_\phi \log \piphi\agivenb{a}{s}\]
Note: In the infinite-horizon case, \(\eqref{eq:AC_state_visitation_measure}\) becomes \(\eta_\phi(s) = \sum_{t=0}^{\infty} \pC{s_t = s}{p_0, \piphi}\) \(\Rightarrow\) Equation \(\eqref{eq:AC_policy_gradient_Q_episodic}\) becomes \[\nabla_\phi L(\phi) = \Expsub{\nablaphi \log \piphi\agivenb{a}{s} \Qpiphi(s, a)}{s \sim \eta_\phi, a \sim \piphi}.\]
Here’s the policy gradient theorem in the two versions we have derived (Sampling versions in blue).
\[ \nablaphi L(\phi) = \Expsub{\sum_{t=0}^{T-1} \nablaphi \log\piphi\agivenb{a_t}{s_t}\cbracket{\sum_{t'=t}^{T-1}r_{t'}}}{\tau\sim p_\phi(\tau)} \fragment{ \approx \textcolor{blue}{\frac{1}{N} \sum_{i=1}^N \cbracket{\sum_{t'=t}^{T-1} \nablaphi \log\,\piphi\agivenb{a_{i,t}}{s_{i,t}}\cbracket{\sum_{t'=t}^{T-1} r_{i,t'} }}}. } \]
\[ \nabla_\phi L(\phi) = \Expsub{\sum_{t=0}^{T-1} \nablaphi \log \piphi\agivenb{a_t}{s_t} \Qpiphi(s_t, a_t)}{\tau\sim p_\phi(\tau)} \fragment{ \approx\textcolor{blue}{\frac{1}{N} \sum_{i=1}^N \cbracket{\sum_{t=0}^{T-1} \nablaphi \log\,\piphi\agivenb{a_{i,t}}{s_{i,t}} \Qpiphi(s_{i,t},a_{i,t})} }. } \]
Questions:
Note: The inconsistency between the supersripts \(\pi\) and \(\pi_\phi\) for \(Q\) is not accidental. We will soon see what’s the reason behind this.
We would like to reduce the variance of the new formulation using a baseline \(b\): \[ \nabla_\phi L(\phi) = \Expsub{\sum_{t=0}^{T-1} \nablaphi \log \piphi\agivenb{a_t}{s_t} \rbracket{Q^\pi(s_t, a_t) - b}}{\tau\sim p_\phi(\tau)} \approx\textcolor{blue}{\frac{1}{N} \sum_{i=1}^N \cbracket{\sum_{t=0}^{T-1} \nablaphi \log\,\piphi\agivenb{a_{i,t}}{s_{i,t}} \rbracket{Q^\pi(s_{i,t},a_{i,t}) - b} } }. \]
The advantage function \(A^\pi(s,a)\) describes how much better the action \(a\) is over the average action when following \(\pi\): \[ A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s). \]
Policy gradient with value baseline: “maximize the policy likelihood, weighted by the advantage function”: \[ \nabla_\phi L(\phi) = \Expsub{\sum_{t=0}^{T-1} \nablaphi \log \piphi\agivenb{a_t}{s_t} A^\pi(s_t, a_t)}{\tau\sim p_\phi(\tau)} \fragment{ \approx\textcolor{blue}{\frac{1}{N} \sum_{i=1}^N \cbracket{\sum_{t=0}^{T-1} \nablaphi \log\,\piphi\agivenb{a_{i,t}}{s_{i,t}} A^\pi(s_{i,t},a_{i,t})} }. } \]
Actor: Neural network with parameters \(\phi\) approximates the policy: \[\pi_\phi \approx \pi.\]
Critic: Approximates \(V^\pi\) / \(Q^\pi\) / \(A^\pi\) and criticizes the actor: \[ V_\theta \approx V^\pi, \qquad Q_\theta \approx Q^\pi, \qquad A_\theta \approx A^\pi.\]
Policy gradient using, e.g., \(Q_\theta\): \[\nabla_\phi L(\phi) = \Expsub{\sum_{t=0}^{T-1} \nablaphi \log \piphi\agivenb{a_t}{s_t} Q_\theta(s_t, a_t)}{\tau\sim p_\phi(\tau)}.\]
Inspired by Sergey Levine’s CS285 lecture.
Two assumptions that lead to the following approximation: \[Q^\pi(s_t, a_t) = \ExpCsub{r_{t}}{s_t,a_t}{\pi} + \Expsub{V^\pi(s_{t+1})}{s_{t+1}\sim \pC{\cdot}{s_t,a_t}} \fragment{ \approx r_{t} + V^\pi(s_{t+1}). }\]
💡 We already know what’s wrong with this approach!
\(\Rightarrow\) The target changes along with the fitted value function in step 2.
\(\Rightarrow\) Use target network \(\theta'\)?
\[ \underbrace{\nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \cbracket{\sum_{t'=t}^{T-1} \nablaphi \log\,\piphi\agivenb{a_{i,t}}{s_{i,t}}\cbracket{\sum_{t'=t}^{T-1} \textcolor{red}{\gamma^{t'-t}} r_{i,t'} }}}_{\text{Option 1: } \eqref{eq:AC_discount_v1}} \quad\text{vs.}\quad \underbrace{\nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \textcolor{red}{\gamma^t} \nablaphi \log\pi_\phi\agivenb{a_{i,t}}{s_{i,t}}\cbracket{\sum_{t'=t}^{T-1}\textcolor{red}{\gamma^{t' - t}}r_{i,t'}}}_{\text{Option 2: }\eqref{eq:AC_discount_v3}} \]
\[ \begin{align*} \text{MC reward sampling:}\quad\nabla_\phi L(\phi) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_{i,t}}{s_{i,t}}\cbracket{\sum_{t'=t}^{T-1}\textcolor{red}{\gamma^{t' - t}}r_{i,t'}} \\ \text{Advantage function:}\quad\nabla_\phi L(\phi) &\approx\frac{1}{N} \sum_{i=1}^N \Big(\sum_{t=0}^{T-1} \nablaphi \log\,\piphi\agivenb{a_{i,t}}{s_{i,t}} \underbrace{\cbracket{r_{i,t} + \textcolor{red}{\gamma} V_\theta(s_{i,t+1}) - V_\theta(s_{i,t})}}_{A_\theta(s_{i,t},a_{i,t})} \Big) \end{align*} \]
Now that we need to fit two functions, \(\pi_\phi\) and \(V_\theta\), how do we do this in practice?
The algorithm is broken in two places! Can you spot them?
\(\Rightarrow\) To make actor-critic an off-policy algorithm, we need to fix both!
Step 3: Use the \(Q\)-function instead of the value function!
\(\circ\) To approximate the value function of the current policy \(\pi_\phi\), we can simply sample \(a'_i\sim \pi_\phi\agivenb{\cdot}{s'_i}\).
\(\circ\) Recall (for Step 4.): \(V^\pi(s_i) = \Expsub{Q(s_i,a_i)}{a_i \sim \policy{\cdot}{s_i}}\).
Step 5: Sample the actions from the current policy, not from the replay buffer!
\(\circ\) \(a^{\pi_\phi}_{i} \sim \pi_\phi\agivenb{\cdot}{s_i}\).
Note 1: The data is still sampled from the “wrong” distribution, i.e., it is not from \(p_\phi(s)\)!
\(\Rightarrow\) There is nothing we can do about this. However, we can see this as positive, as we train a policy on a broader distribution.
Note 2: It is very common to use \(Q_\theta\) instead of \(A_\theta\) in Step 5, even though skipping the baseline leads to higher variance.
\(\Rightarrow\) Skipping the baseline is a good tradeoff, as we can simply sample additional actions (which does not require acquiring new states(!)), and thus shrink the variance arbitrarily!
Let’s directly compare the policy gradient from the last lecture against our actor-critic procedure from today:
\[ \nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_{i,t}}{s_{i,t}}\cbracket{\cbracket{\sum_{t'=t}^{T-1}\gamma^{t'-t} r_{i,t'}} - b}\quad\]
\(\textcolor{green}{\mathbf{+}\text{ no bias}}\)
\(\textcolor{red}{\mathbf{-}\text{ high variance (single-sample estiamte)}}\)
\[\nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_{i,t}}{s_{i,t}} \cbracket{r_{i,t} + \gamma V_\theta(s_{i,t+1}) - V_\theta(s_{i,t})}\]
\(\textcolor{green}{\mathbf{+}\text{ lower variance (due to critic)}}\)
\(\textcolor{red}{\mathbf{-}\text{ not unbiased (if critic is imperfect)}}\)
Question: Can we use a critic (i.e., \(V_\theta\)) and still keep the esimator unbiased?
\(\Rightarrow\) we can make the baseline \(b\) state-dependent (proof very similar to the one from last week)!
\[ \nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_{i,t}}{s_{i,t}}\cbracket{\cbracket{\sum_{t'=t}^{T-1}\gamma^{t'-t} r_{i,t'}} - V_\theta(s_{i,t})} \]
\(\textcolor{green}{\mathbf{+}\text{ no bias}}\)
\(\textcolor{green}{\mathbf{+}\text{ lower variance}}\)
A natural follow-up question is: If a state-dependent baseline is stronger than a constant one, wouldn’t a baseline depending on states and actions be even better?
\(\Rightarrow\) The answer is yes! We can take the \(Q\)-function as the baseline as well. We call these approaches control variates.
\[ A_\theta(s_t,a_t) = \cbracket{\sum_{t'=t}^{T-1}\gamma^{t'-t} r_{t'}} - V_\theta(s_{t}) \]
\(\textcolor{green}{\mathbf{+}\text{ no bias}}\)
\(\textcolor{red}{\mathbf{-}\text{ higher variance (single-sample)}}\)
\[ A_\theta(s_t,a_t) = \cbracket{\sum_{t'=t}^{T-1}\gamma^{t'-t} r_{t'}} - Q_\theta(s_{t},a_t) \]
\(\textcolor{green}{\mathbf{+}\text{ goes to zero in expectation (if critic correct)}}\)
\(\textcolor{red}{\mathbf{-}\text{ formula is incorrect!}}\)
The term that was neglected above: \[ \nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_{i,t}}{s_{i,t}}\cbracket{\hat{Q}_{i,t} - Q_\theta(s_{i,t}, a_{i,t})} \textcolor{blue}{+ \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \Expsub{Q_\theta(s_{i,t},a_{t})}{a_t \sim \pi_\theta\agivenb{\cdot}{s_{i,t}}}}. \]
\(\Rightarrow\) This one is often easier to estimate. Finite \(\Ac\): compute sum! Continuous \(\Ac\): sampling actions is easy!
Similar to standard TD learning, we can again find some middle ground between the one-step estimate and the MC estimate: \[\begin{align*} \text{One-step TD estimate:}\qquad A_\theta(s_t,a_t) &= r_{t} + \gamma V_\theta(s_{t+1}) - V_\theta(s_t), \\ \text{MC estimate:}\qquad A_\theta(s_t,a_t) &= \cbracket{\sum_{t'=t}^{T-1}\gamma^{t'-t} r_{t'}} - V_\theta(s_{t}). \end{align*}\]
Just as before, we can simulate \(n\) steps, before bootstrapping the non-simulated piece of our trajectory: \[\qquad\qquad\qquad\qquad\text{$n$-step estimate:}\qquad A^n_\theta(s_t,a_t) = \cbracket{\sum_{t'=t}^{t+n}\gamma^{t'-t} r_{t'}} + \gamma^n V_\theta(s_{t+n}) - V_\theta(s_{t}).\]
From the \(n\)-step estimate, the next step in TD learning was to average over all possible \(n\)-step estimates \(\Rightarrow\) TD(\(\lambda\))!
\[\begin{align*} A^{\mathsf{GAE}}_\theta(s_t,a_t) &= r_t + \gamma \Big((1-\lambda) V_\theta(s_{t+1}) + \lambda \big(r_{t+1} + \gamma \cbracket{(1-\lambda) V_\theta(s_{t+2}) + \lambda (r_{t+2} + \ldots)}\big)\Big), \\ A^{\mathsf{GAE}}_\theta(s_t,a_t) &= \sum_{t'=t}^\infty (\gamma\lambda)^{t'-t} \delta_{t'}, \qquad \text{with TD error}\quad\delta_{t'} = r_{t'} + \gamma V_\theta(s_{t'+1}) - V_\theta(s_{t'}). \end{align*}\]
\(\Rightarrow\) The discount factor serves as a means to trade off bias and variance!