Deep Reinforcement Learning

Notation

Prof. Dr. Sebastian Peitz

Chair of Safe Autonomous Systems, TU Dortmund

Summer term 2026

🚀 by Decker

Sets are written in caligraphic upper-case, \(\Ac, \Sc, \Oc,\dots\).
States, actions, or other elements of sets are written as lower-case letters, \(s,a,o,\dots\).
The letter \(t\) is reserved for an iterator along a trajectories temporal dimension, its maximal value is given by \(T\). Trajectories start at \(t = 0\).
We reserve \(i,j,k,n,m\) to label iterators, maximal values achieved by them are denoted by the corresponding upper-case letter, i.e., \(I,J,K,N,M\).

An MDP is a tuple \((\Sc, \Ac, p, r, \gamma)\). It is consists of a set of states \(\Sc\) and actions \(\Ac\). The transition distribution (discrete case) or density (continuous case) is denoted by \(p\agivenb{s'}{s,a}\). Finally, we have a reward function \(r : \Sc \times \Ac \rightarrow \R\) and a discount factor \(\gamma \in [0,1]\).
For a given MDP, a policy \(\pi\agivenb{a}{s}\) is either a distribution (discrete case) or a density (continuous case) over actions.
Outputs are elements of some set \(\Oc\), obtained by possibly implicit measurements of given state-action pairs, \(o_t = h(s_t,a_t)\), where \(h : \Sc \times \Ac \rightarrow \Oc\).

Random variables with distribution (discrete case) or density (continuous case) \(q\) are declared by writing \(x \sim q\). E.g., \(s' \sim p\agivenb{\cdot}{s,a}\) or \(a \sim \pi\agivenb{\cdot}{s}\).
Trajectories are sequences of state-action pairs \[ \tau = (\tau_0, \tau_1,\dots) = ((s_0,a_0) , (s_1,a_1),\dots). \] If the trajectory is sampled according to some policy \(\pi\) (i.e., \(s_{t+1} \sim p\agivenb{\cdot}{s_t,a_t}\) and \(a_t \sim \pi\agivenb{\cdot}{s_t}\) for \(t \geq 1\)), we get a new distribution/density, denoted by \(p^\pi\agivenb{\tau}{s_0}\) which is conditioned on the initial state \(s_0\). So \(\tau \sim p^\pi\agivenb{\cdot}{s_0}\) denotes a random trajectory starting from \(s_0\).
Instead of using index notation, one can also use \((s,a),(s',a'),(s'',a''),\dots\) to denote the start of a trajectory.

Expectations are denoted as \[ \E_{x \sim q} [f(x)] = \underbrace{\E_x[f(x)]}_{\text{if distr. clear}} = \underbrace{\E[f(x)]}_{\text{if R.V. clear}} = \begin{cases} \sum_{x \in \Xc} q(x) f(x) &\text{discrete} \\ \int_{\Xc} q(x) f(x) dx &\text{continuous} \end{cases} \] Similarly, the variance is defined by \[ \Var_{x \sim q}[f(x)] = \Var_x[f(x)] = \Var[f(x)] = \E_x[(f(x)-\E_{x'}[f(x')])^2]. \]

The state-value function is defined as \[ V^\pi(s_0) := \E \left[ \sum_{t=0}^\infty \gamma^t r(s_t,a_t) \right], \quad\text{where}\quad s_{t+1} \sim p(\cdot |s_t,a_t), a_{t} \sim \pi(\cdot |s_t). \] alternatively, we can write \[ V^\pi(s_0) = \E_{\tau \sim p^\pi(\cdot | s_0)}\left[\sum_{t=0}^\infty \gamma^t r(\tau_t)\right]. \]
The state-action value function can be defined as \[ Q^\pi(s,a) = \E_{s' \sim p(\cdot |s,a)}[ r(s,a) + \gamma V^\pi(s')]. \]

We have the Bellman equations \[\begin{align*} V^\pi(s) &= \E_{a \sim \pi(\cdot | s), s' \sim p(\cdot | s, a)}[ r(s,a) + \gamma V^\pi(s') ] &&=\E_{a,s'}[r(s,a)+\gamma V^\pi(s')], \\ Q^\pi(s,a) &= \E_{s' \sim p(\cdot |s,a), a' \sim \pi(\cdot |s')}[ r(s,a) + \gamma Q(s',a')] &&= \E_{s',a'}[r(s,a) + \gamma Q(s',a')]. \end{align*}\]
Optimal value functions are denoted by \(V^*\) or \(Q^*\). They satisfy the Bellman optimality equations \[\begin{align*} V^*(s) &= \max_{a \in \Ac} \E_{s'}[r(s,a)+\gamma V^*(s')] \\ Q^*(s,a) &= \E_{s'}[ r(s,a) + \gamma \max_{a' \in \Ac} Q^*(s',a')]. \end{align*}\] An optimal policy is denoted by \(\pi^*\), satisfying \(V^* = V^{\pi^*}\).

Trainable parameters are denoted by \(\theta, \psi, \phi\). Dependence on \(\theta\) for function approximators is denoted via subscript notation, i.e., \(V^\pi_\theta\), \(Q^\pi_\theta\), \(\pi_\theta\), etc. Differentiation w.r.t. a parameter is denoted by \(\nabla_\theta (\cdot )\).
Objective functions (for instance in the policy gradient algorithm) are denoted by \(J\).
Sequences of value function (approximations) are denoted by \(V_0,V_1,V_2,\dots\) when one has, for example, \(V_n \rightarrow V^\pi\) as \(n \rightarrow \infty\). (So iterator is in subscript notation.) Similarly, sequences of trainable parameters are denoted by \(\theta_0, \theta_1,\theta_2,\dots\).
Datasets are denoted by \(\Dc\), containing either states, state-action pairs, or other content gathered while sampling from an MDP.