Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
| Chapter | Topic | Content |
|---|---|---|
| Basics & tabular methods | ||
| 1-5 | Bandits, MDPs, Dynamic Programming, Monte Carlo, TD Learning | RL basics in finite dimensions |
| Deep-learning-based methods | ||
| 6 | Brief introduction to deep learning | The basics for what comes next |
| 7 | Value function approximation | Value estimation with function approximation |
| 8 | Deep \(Q\)-learning | \(Q\)-learning with neural networks |
| 9 | Policy gradients | Direct optimization of the policy |
| 10 | Actor-critic algorithms | |
| 11 | Advanced algorithms (Part I): From policy gradient to PPO | |
| 12 | Advanced algorithms (Part II): From \(Q\)-learning to Soft Actor-Critic | |
| Model-Based Control | ||
| Advanced Topics |
Inspired by Sergey Levine’s CS285 lecture.
Alternative approach:
\[ \begin{align*} &\Rightarrow\quad p_\phi(\underbrace{s_0,a_0,\ldots,s_{T-1},a_{T-1},s_{T}}_{=\tau}) \fragment{= p_\phi(\tau)} \fragment{= p(s_0) \prod_{t=0}^{T-1} \pi_\phi\agivenb{a_t}{s_t} \pC{s_{t+1}}{s_t,a_t}.} \\ &\Rightarrow\quad \phi^* = \arg\max_{\phi}\Expsub{\sum_{t=0}^{T-1}r_t}{\tau\sim p_\phi(\tau)} \fragment{ = \arg\max_{\phi}\Expsub{V^{\pi_\phi}(s_0)}{s_0 \sim p(s_0)}. } \end{align*} \]
\[ \begin{align*} \phi^* &= \arg\max_{\phi}\Expsub{\sum_{t=0}^{T-1}r_t}{\tau\sim p_\phi(\tau)} = \arg\max_{\phi}\Expsub{r(\tau)}{\tau\sim p_\phi(\tau)}\\ \text{Infinite horizon case:}\quad\phi^* &= \arg\max_{\phi}\Expsub{r}{(s,a)\sim p_\phi(s,a)}\\ \text{Finite horizon case:}\quad\phi^* &= \arg\max_{\phi}\sum_{t=0}^{T-1}\Expsub{r_t}{(s_t,a_t)\sim p_\phi(s_t,a_t)} \end{align*} \]
đź’ˇ We are not considering the discount factor \(\gamma\) for now.
\[ \phi^* = \arg\max_{\phi}\underbrace{\Expsub{\sum_{t=0}^{T-1}r_t}{\tau\sim p_\phi(\tau)}}_{L(\phi)} \]
\(\Rightarrow\) Monte Carlo sampling! \[L(\phi) = \Expsub{\sum_{t=0}^{T-1}r_t}{\tau\sim p_\phi(\tau)} \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^{T-1}r_{i,t} \]
Inspired by Sergey Levine’s CS285 lecture.
Policy gradient: Depending on the return, the plan is to increase or reduce the trajectories’ likelihoods by adapting the policy \(\pi\).
đź’ˇ The following policy gradient theorem was originally introduced in (Richard S. Sutton et al. 1999).
The expected value of a function \(x(\tau)\) is
\[ \Expsub{x(\tau)}{\tau\sim p} = \int p(\tau) x(\tau) \dtau. \]
Here, \(p\) is the density according to which \(\tau\) is distributed, with \(\int p(\tau) \dtau = 1\).
\[ \begin{align*} \phi^* &= \arg\max_{\phi}\underbrace{\Expsub{\sum_{t=0}^{T-1}r_t}{\tau\sim p_\phi(\tau)}}_{L(\phi)}\\ L(\phi) &= \E_{\tau\sim p_\phi(\tau)}[\underbrace{r(\tau)}_{=\sum_{t=0}^{T-1}r_t}] = \int p_\phi(\tau) r(\tau) \dtau\\ \nablaphi L(\phi) &= \nablaphi \rbracket{\int p_\phi(\tau) r(\tau) \dtau}\\ &= \int \textcolor{red}{\nablaphi p_\phi(\tau)} r(\tau) \dtau \end{align*}\]
Note: Integration and differentiation are linear \(\Rightarrow\) We can swap!
\[ \begin{equation} \textcolor{red}{\nablaphi p_\phi(\tau)} \fragment{ = p_\phi(\tau) \frac{\nablaphi p_\phi(\tau)}{p_\phi(\tau)} } \fragment{ = \textcolor{blue}{p_\phi(\tau) \nablaphi \log p_\phi(\tau)} } \label{eq:PG_log_identity} \end{equation} \]
\[\begin{align*} \quad~= \int \textcolor{blue}{p_\phi(\tau) \nablaphi \log p_\phi(\tau)} r(\tau) \dtau \end{align*} \] \[\begin{align} \quad = \Expsub{\nablaphi \log p_\phi(\tau) r(\tau)}{\tau\sim p_\phi(\tau)} \label{eq:PG_policy_gradient} \end{align} \]
\[ \begin{align*} \phi^* &= \arg\max_{\phi}L(\phi)\\
L(\phi) &= \Expsub{r(\tau)}{\tau\sim p_\phi(\tau)} = \int p_\phi(\tau) r(\tau) \dtau\\
\nablaphi L(\phi) &= \Expsub{\textcolor{blue}{\nablaphi \log p_\phi(\tau)} r(\tau)}{\tau\sim p_\phi(\tau)}
\end{align*}\]
\[\begin{align*} \underbrace{p_\phi(s_0,a_0,\ldots,s_{T-1},a_{T-1},s_{T})}_{=p_\phi(\tau)} &= p(s_0) \prod_{t=0}^{T-1} \pi_\phi\agivenb{a_t}{s_t} \pC{s_{t+1}}{s_t,a_t} \\ \text{take log on both sides!}\quad &\Downarrow \quad\fragment{ \textcolor{gray}{(\log(a\cdot b) = \log(a) + \log(b))} } \\ \log p_\phi(\tau) = \log p(s_0) + &\sum_{t=0}^{T-1} \rbracket{\log\pi_\phi\agivenb{a_t}{s_t} + \log \pC{s_{t+1}}{s_t,a_t}} \end{align*}\]
\[\begin{align*} \textcolor{blue}{\nablaphi \log p_\phi(\tau)} &= \nablaphi \rbracket{\log p(s_0) + \sum_{t=0}^{T-1} \rbracket{\log\pi_\phi\agivenb{a_t}{s_t} + \log \pC{s_{t+1}}{s_t,a_t}}}\\ &= \nablaphi \log p(s_0) + \sum_{t=0}^{T-1} \rbracket{\nablaphi \log\pi_\phi\agivenb{a_t}{s_t} + \nablaphi \log \pC{s_{t+1}}{s_t,a_t}}\\ &= \textcolor{red}{\underbrace{\cancel{\nablaphi \log\, p(s_0)}}_{=0}} + \sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_t}{s_t} + \textcolor{red}{\underbrace{\cancel{\nablaphi \log\, \pC{s_{t+1}}{s_t,a_t}}}_{=0}} \end{align*}\]
\[\begin{align*} \nablaphi L(\phi) = \Expsub{\cbracket{\sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_t}{s_t}}r(\tau)}{\tau\sim p_\phi(\tau)}\\ = \Expsub{\cbracket{\sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_t}{s_t}}\cbracket{\sum_{t=0}^{T-1}r_t}}{\tau\sim p_\phi(\tau)} \end{align*}\]
What does this mean?
\[\nablaphi L(\phi) = \Expsub{\cbracket{\sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_t}{s_t}}\cbracket{\sum_{t=0}^{T-1}r_t}}{\tau\sim p_\phi(\tau)}\]
We now have a formulation of the policy gradient that we can approximate using Monte Carlo sampling: \[ \nablaphi L(\phi) = \Expsub{\cbracket{\sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_t}{s_t}}\cbracket{\sum_{t=0}^{T-1}r_t}}{\tau\sim p_\phi(\tau)} \fragment{ \approx \textcolor{blue}{\frac{1}{N} \sum_{i=1}^N} \underbrace{\textcolor{blue}{\cbracket{\sum_{t=0}^{T-1} \nablaphi \log\,\pi_\phi\agivenb{a_{i,t}}{s_{i,t}}}}}_{\fragment{ \text{(I)} }}\underbrace{\textcolor{blue}{\cbracket{\sum_{t=0}^{T-1}r_{i,t}}}}_{\fragment{ \text{(II)} }}. }\]
(I) This term is simpliy the derivative of the policy network w.r.t. the weights \(\Rightarrow\) backpropagation!
(II) This term is the experience we collect from interactions with the environment.
Maximum likelihood gradient: \[\nablatheta \log L(\theta) = \sum_{i=1}^N \nablatheta \log p_\theta\agivenb{y_i}{x_i}.\]
Comparison against the policy gradient (MC sampling):
\[ \nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \cbracket{\sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_{i,t}}{s_{i,t}}}\cbracket{\sum_{t=0}^{T-1}r_{i,t}}.\]
The policy gradient also holds for partially observed MDPs (POMDPs). That is, for policies \(\policy{a}{o}\). In simple terms, the reason is that the policy gradient theorem does not make use of the Markov property.
\[\nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \nablaphi \log\pi_\phi(\tau_i) r(\tau_i). \]
\[\nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \nablaphi \log\pi_\phi(\tau_i) r(\tau_i). \]
Approach: subtract a baseline \(b\) from the reward signal \[ \nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \nablaphi \log\pi_\phi(\tau_i) \rbracket{r(\tau_i) - b}, \] where \(b=\frac{1}{N} \sum_{i=1}^N r(\tau_i)\).
Subtracting any constant \(b\) is unbiased in expectation
Proof: Start with the exact formulation of the policy gradient following \(\eqref{eq:PG_policy_gradient}\), \[ \nablaphi L(\phi) = \Expsub{\nablaphi \log p_\phi(\tau) \cbracket{r(\tau) - b} }{\tau\sim p_\phi(\tau)}. \] Unbiased in expectation \(\Rightarrow\) baseline term is zero in expectation: \[\begin{align*} &\Expsub{\nablaphi \log p_\phi(\tau) b }{\tau\sim p_\phi(\tau)} \fragment{ = \int \textcolor{blue}{p_\phi(\tau) \nablaphi \log p_\phi(\tau)} b \dtau } \\ &\stackrel{\eqref{eq:PG_log_identity}}{=} \int \textcolor{red}{\nablaphi p_\phi(\tau)} b \dtau \fragment{ = b \nablaphi \int p_\phi(\tau) \dtau } \fragment{ = \cbracket{\nablaphi 1} b } \fragment{ = 0. \qquad \square } \end{align*}\]
\[ \nablaphi L(\phi) = \Expsub{\nablaphi \log p_\phi(\tau) \cbracket{r(\tau) - b} }{\tau\sim p_\phi(\tau)}. \]
How do we find the optimal baseline?
\(\Rightarrow\) optimization: minimize the variance
\[ \Var{x} = \Exp{x^2} - \Exp{x}^2 \]
\[\begin{align*} \mathsf{var} &= \E_{\tau\sim p_\phi(\tau)}\big[(\underbrace{\nablaphi \log\, p_\phi(\tau)}_{=g(\tau)} \cbracket{r(\tau) - b})^2\big] - \cbracket{\Expsub{\nablaphi \log\, p_\phi(\tau) \cbracket{r(\tau) - \rcancel{b}} }{\tau\sim p_\phi(\tau)}}^2 \quad\text{(\textcolor{red}{unbiased baseline})} \\ \diff{\mathsf{var}}{b} &=\diff{\mathsf{}}{b} \Expsub{g(\tau)^2 \cbracket{r(\tau) - b}^2}{\tau\sim p_\phi(\tau)} \fragment{ = \diff{\mathsf{}}{b} \cbracket{\Expsub{g(\tau)^2 r(\tau)^2}{\tau\sim p_\phi(\tau)} - 2\Expsub{g(\tau)^2 r(\tau)b}{\tau\sim p_\phi(\tau)} + \Expsub{g(\tau)^2 b^2}{\tau\sim p_\phi(\tau)} } } \\ &= \diff{\mathsf{}}{b} \cbracket{\cancel{\Expsub{g(\tau)^2 r(\tau)^2}{\tau\sim p_\phi(\tau)}} - 2b\Expsub{g(\tau)^2 r(\tau)}{\tau\sim p_\phi(\tau)} + b^2 \Expsub{g(\tau)^2 }{\tau\sim p_\phi(\tau)} }\\ &= - 2\Expsub{g(\tau)^2 r(\tau)}{\tau\sim p_\phi(\tau)} +2 b \Expsub{g(\tau)^2 }{\tau\sim p_\phi(\tau)} \fragment{\stackrel{!}{=}0.} \end{align*}\]
The optimal baseline \(b^*\) is the expected reward, weighted by gradient magnitudes: \[ b^* = \frac{\Expsub{g(\tau)^2 r(\tau)}{\tau\sim p_\phi(\tau)}}{\Expsub{g(\tau)^2 }{\tau\sim p_\phi(\tau)}}. \]
Note: \(b^*\) is hard to calculate. In practice: usually take average reward \[b=\Expsub{r(\tau)}{\tau\sim p_\phi(\tau)}\approx\frac{1}{N} \sum_{i=1}^N r(\tau_i).\]
\[ \nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \cbracket{\sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_{i,t}}{s_{i,t}}}\cbracket{\sum_{t'=0}^{T-1}r_{i,t'}}.\]
Inspired by Sergey Levine’s CS285 lecture.
💡 Do not confuse causality with the Markov property! The Markov propery (which may hold for a system, but does not have to) says that your future states do not depend on past states, just the present state. Causality is always true: “Rewards in the past are independent of decisions in the present.”
The trouble in Deep RL:
How do we calculate the expectation w.r.t. a different distribution? \[\begin{align*} \Expsub{f(x)}{x\sim p} &= \int p(x) f(x) \dx \\ &= \int \frac{q(x)}{q(x)} p(x) f(x) \dx \\ &= \int q(x) \frac{p(x)}{q(x)} f(x) \dx \\ &= \Expsub{\frac{p(x)}{q(x)} f(x)}{x\sim q}. \\ \end{align*}\] đź’ˇ This formula is exact!
\[p_\phi(\tau) \nablaphi \log p_\phi(\tau) = \nablaphi p_\phi(\tau)\]
đź’ˇ As derived earlier, using \(p_{\phi'}(\tau)\) or \(\pi_{\phi'}(\tau)\) (i.e., the product over action probabilities along the trajectories) yields the same result \[\nablaphi \log p_\phi(\tau) = \cancel{\nablaphi \log\, p(s_0)} + \sum_{t=0}^{T-1} \nablaphi \log\pi_\phi\agivenb{a_t}{s_t} + \cancel{\nablaphi \log\, \pC{s_{t+1}}{s_t,a_t}} = \nablaphi \log\pi_\phi(\tau).\]
\[ \nablaphi L(\phi) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \nablaphi \log\,\pi_\phi\agivenb{a_{i,t}}{s_{i,t}}\underbrace{\hat{Q}_{i,t}}_{= \sum_{t'=t}^{T-1}r_{i,t'}}.\]
Pseudocode example
# Define a simple policy network (actor)
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super(PolicyNetwork, self).__init__()
self.fc = nn.Sequential(
nn.Linear(state_dim, 64),
nn.ReLU(),
nn.Linear(64, action_dim)
)
def forward(self, state):
logits = self.fc(state)
return Categorical(logits=logits) # Outputs a policy distribution
# Initialize network and optimizer
policy = PolicyNetwork(state_dim=n, action_dim=m)
optimizer = optim.Adam(policy.parameters(), lr=0.01)
# Forward pass: get the action distribution for the current state
dist = policy(state)
# Calculate the log probability of the action that was actually taken
log_prob = dist.log_prob(action_taken)
# Define the Pseudo-Loss: -log_prob * Reward ("-" due to gradient descent)
pseudo_loss = -(log_prob * reward)
# Autodiff calculates the policy gradient
pseudo_loss.backward()
optimizer.step()