Deep Reinforcement Learning

Advanced Algorithms

Prof. Dr. Sebastian Peitz

Chair of Safe Autonomous Systems, TU Dortmund

Summer term 2026
🚀 by Decker

Content

Content

  • The evolution of modern RL algorithms
  • Part (II): Improving \(Q\)-learning
    • Deterministic policy gradient
    • Deep deterministic policy gradient (DDPG)
    • TD3
    • Soft Actor-Critic

Where are we?

Lecture contents
Chapter Topic Content
Basics & tabular methods
1-5 Bandits, MDPs, Dynamic Programming, Monte Carlo, TD Learning RL basics in finite dimensions
Deep-learning-based methods
6 Brief introduction to deep learning The basics for what comes next
7 Value function approximation Value estimation with function approximation
8 Deep Q-learning Q-learning with neural networks
9 Policy gradients Direct optimization of the policy
10 Actor-critic algorithms Improved policy gradients via value functions
11 Advanced algorithms (Part I): From policy gradient to PPO The evolution of moderl RL algorithms
12 Advanced algorithms (Part II): From \(Q\)-learning to Soft Actor-Critic The evolution of moderl RL algorithms
Advanced Topics

The evolution of modern RL algorithms

The evolution of modern RL algorithms

(I) Policy gradient methods

  • Policy is explicitly parameterized and optimized directly: \[L(\phi)=\Expsub{r(\tau)}{\tau\sim p_\phi(\tau)}.\]
  • Gradient descent: \(\phi \gets \phi + \alpha \nabla_\phi L(\phi)\).
  • Methods are typically on-policy.
  • Algorithms: REINFORCE \(\rightarrow\) Actor-Critic \(\rightarrow\) Natural Policy Gradient \(\rightarrow\) Trust-region policy optimization (TRPO) \(\rightarrow\) Proximal policy optimization (PPO).

(II) Value-based methods

  • Approximation of the \(Q\)-function:
    \[Q^*(s,a) = r + \max_{a'\in\Ac}Q^*(s',a').\]
  • Extract implicit policy: \(\pi(s) = \arg\max_{a\in\Ac} Q^*(s,a)\).
  • Typically off-policy.
  • Algorithms: Q-learning \(\rightarrow\) DQN \(\rightarrow\) Deep deterministic policy gradient (DDPG) \(\rightarrow\) TD3 \(\rightarrow\) Soft actor-critic (SAC).

Shortcomings

  • High variance gradients.
  • Unstable policy updates.
  • Catastrophic performance collapse.

Shortcomings

  • Instability from bootstrapping.
  • Overestimation bias.
  • Maximization over actions / continuous action spaces.

Improvement strategy

  • How to safely update policies.

Improvement strategy

  • Stabilizing \(Q\)-learning with function approximation.
  • Solving the continuous argmax problem.

Part (II): Improving \(Q\)-learning

Deterministic policy gradients

Deep deterministic policy gradient (DDPG)

TD3

Soft actor-critic (SAC)

References