Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
find a policy that maximizes the sum of future rewards!
All the pieces of our RL framework are generally subject to some randomness
\(\Rightarrow\) we need to deal with random variables
A random variable \(X\) is a measurable function \(X: \Omega \rightarrow E\) from a sample space \(\Omega\) as a set of possible outcomes to a measurable space \(E\) (i.e., we can assign a number to each outcome).
The realization \(x\) is the value of \(X(\omega)\) after a single experiment \(\omega\in\Omega\).
We can now assign probability functions \(p\) to random variables. \(p(X=x)\) denotes the probability of the random variable \(X\) having the realization \(x\).
The max-of-two-rolls example
In summary, we obtain a probability distribution over \(X\), i.e., \(p(X)\).
\(\Rightarrow\) \(X\) is distributed according to \(p(X)\): \[X\sim p(X).\]
Simplified notation
đĄ in the literature, people sometimes use \(p(x)\) to also denote the probability distribution over all possible values of \(x\)!
\(\Rightarrow\) \(x\) is distributed according to \(p\): \[x\sim p \quad / \quad x\sim p(x).\]
In the literature, we often also find capital letters for state, action and reward to properly account for the probabilistic nature
| All states observable | Not all states are observable | |
|---|---|---|
| No actions | Markov chain | Hidden Markov model |
| With actions | Markov decision process (MDP) | Partially observable MDP |
A finite Markov chain is a tuple \((\Sc, p)\), where
This is a specific stochastic process.
\(^*\) \(p\agivenb{s'}{s}\) meaning \(p\agivenb{S_{t+1}=s'}{S_t=s}\) in extended notation.
Given a Markov state \(s\) and its successor \(s'\), the state transition probability \(\forall \{s,s'\} \in \Sc\times\Sc\) is defined by the matrix \[ P=\begin{bmatrix} p_{1,1} & p_{1,2} & \dots & p_{1,n} \\ p_{2,1} & \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & p_{n-1,n} \\ p_{n,1} & \dots & p_{n,n-1} & p_{n,n} \end{bmatrix}. \]
\(s\in\{1,2,3,4\}=\{\text{Small},\text{Medium},\text{Large},\text{Gone}\}\)
\[ P = \fragment{\begin{bmatrix} 0 & 1-\alpha & 0 & \alpha \\ 0 & 0 & 1-\alpha & \alpha \\ 0 & 0 & 1-\alpha & \alpha \\ 0 & 0 & 0 & 1 \end{bmatrix}} \]
Source: Oliver Wallscheidâs RL course
Possible samples for the given Markov chain example starting at \(s=1\) (small tree):
A finite Markov reward process (MRP) is a tuple \((\Sc, p, \textcolor{red}{r}, \textcolor{red}{\gamma})\), where
In this example, the reward is deterministic, but in general, it is a random variable
The return \(g_t\) is the discounted sum of future rewards, starting from step \(t\).
Exemplary samples for \(g\) with \(\gamma = 0.5\), starting with the planting (\(s=1\)): \[ \begin{align*} s &= 1 \rightarrow 4, & g &= 0, \\ s &= 1 \rightarrow 2 \rightarrow 4, & g &= 0 + 0.5 \cdot 1 = 0.5, \\ s &= 1 \rightarrow 2 \rightarrow 3 \rightarrow 4, & g &= 0 + 0.5 \cdot 1 + 0.25 \cdot 2 = 1, \\ s &= 1 \rightarrow 2 \rightarrow 3 \rightarrow 3 \rightarrow 4, & g &= 0 + 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 = 1.375. \end{align*} \]
The state-value function \(V(s)\) (or simply value function) of an MRP is the expected return starting from state \(s\) at time \(t\): \[ V(s) = \ExpC{g_t}{s_t = s}. \]
Challenge: How to calculate all state values in closed form?
Solution: The Bellman equation.
\[ \begin{align*} V(s_t) &= \ExpC{g_t}{s_t} \\ &= \ExpC{r_{t} + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots}{s_t} \\ &= \ExpC{r_{t} + \gamma \left(r_{t+1} + \gamma r_{t+1} + \ldots\right)}{s_t} \\ &= \ExpC{r_{t} + \gamma g_{t+1}}{s_t} \\ &= \ExpC{r_{t} + \gamma V(s_{t+1})}{s_t} \\ &= \ExpC{r_{t}}{s_t} + \gamma\ExpC{V(s_{t+1})}{s_t} \end{align*} \]
Assuming a known expected reward \(\hat r\) for every state \(s\in\Sc\), \[ \hat r_\Sc = \begin{bmatrix} \ExpC{r}{s_1} & \ldots & \ExpC{r}{s_n} \end{bmatrix}^\top = \begin{bmatrix} \hat r_1 & \ldots & \hat r_n \end{bmatrix}^\top, \]
for a finite number of \(n\) states states with unknown state values \[ V_\Sc = \begin{bmatrix} V(s_1) & \ldots & V(s_n) \end{bmatrix}^\top = \begin{bmatrix} V_1 & \ldots & V_n \end{bmatrix}^\top, \]
we can derive a linear equation system:
\[ \begin{align*} V_\Sc &= \hat r_\Sc + \gamma P V_\Sc \\ \begin{bmatrix} V_1 \\ \vdots \\ V_n \end{bmatrix} &= \begin{bmatrix} \hat r_1 \\ \vdots \\ \hat r_n \end{bmatrix} + \gamma \begin{bmatrix} p_{1,1} & \ldots & p_{1,n} \\ \vdots & & \vdots \\ p_{n,1} & \ldots & p_{n,n}\end{bmatrix} \begin{bmatrix} V_1 \\ \vdots \\ V_n \end{bmatrix} \end{align*} \]
\[ \begin{align*} V_\Sc &= \hat r_\Sc + \gamma P V_\Sc \\ \Leftrightarrow \quad (I-\gamma P) V_\Sc &= \hat r_\Sc \end{align*} \]
Letâs set the disaster probability to \(\alpha = 0.2\) and the discount factor to \(\gamma=0.8\).
\[P = \begin{bmatrix} 0 & 0.8 & 0 & 0.2 \\ 0 & 0 & 0.8 & 0.2 \\ 0 & 0 & 0.8 & 0.2 \\ 0 & 0 & 0 & 1 \end{bmatrix}\]
\[ \begin{align*} \hat r_1 &= 0.2 \cdot 0 + 0.8 \cdot 1 &= 0.8 \\ \hat r_2 &= 0.2 \cdot 0 + 0.8 \cdot 2 &= 1.6 \\ \hat r_3 &= 0.2 \cdot 0 + 0.8 \cdot 3 &= 2.4 \\ \hat r_4 &= 1 \cdot 0 &= 0 \\ \end{align*} \]
A finite Markov decision process (MDP) is a tuple \((\Sc, \textcolor{red}{\Ac}, p, r, \gamma)\), where
In an MDP environment, a policy is a distribution over actions given states: \[ \pi\agivenb{a_t}{s_t} = p\agivenb{a_t}{s_t}.^* \]
\(^*\) In extended notation: \(p\agivenb{A_t=a_t}{S_t=s_t}\).
Given a finite MDP \(âš\Sc, \Ac, p, r, \gammaâ©\) and a policy \(\pi\):
The state-value function of an MDP is the expected return starting in \(s\) following policy \(\pi\): \[\begin{equation}V^\pi(s) = \ExpCsub{g_t}{s_t=s}{\pi} = \ExpCsub{\sum_{k=0}^{\infty}\gamma^k r_{t+k}}{s_t=s}{\pi}.\label{eq:MDP_state-value-function}\end{equation}\]
The action-value function of an MDP is the expected return starting in \(s\), taking action \(a\) and then following policy \(\pi\): \[\begin{equation} Q^\pi(s,a) = \ExpCsub{g_t}{s_t=s, a_t=a}{\pi} = \ExpCsub{\sum_{k=0}^{\infty}\gamma^k r_{t+k}}{s_t=s,a_t=a}{\pi}. \label{eq:MDP_action-value-function}\end{equation}\]
Analog to MRPs, the state value of an MDP can be decomposed into a Bellman notation: \[ V^\pi(s_t) = \ExpCsub{r_{t+1}+\gamma V^\pi(s_{t+1})}{s_t}{\pi}. \]
In finite MDPs, the state value can be directly linked to the action value: \[ V^\pi(s_t)\sum_{a\in\Ac} \pi\agivenb{a_t}{s_t} Q^\pi(s_t,a_t).\]
Likewise, the action value of an MDP can be decomposed into a Bellman notation: \[ Q^\pi(s_t, a_t) = \ExpCsub{r_{t}+\gamma Q^\pi(s_{t+1}, a_{t+1})}{s_t,a_t}{\pi}. \]
In finite MDPs, the action value can be directly linked to the state value: \[\begin{equation} Q^\pi(s_t, a_t) = \Exp{p\agivenb{r}{s_t, a_t}} + \gamma \sum_{s_t\in\Sc} p\agivenb{s_{t+1}}{s_t} V^\pi(s_{t+1}). \label{eq:MDP_BellmanQ} \end{equation}\]
Letâs assume following very simple policy (âfifty-fiftyâ): \[ \pi\agivenb{a=\text{cut}}{s} = 0.5, \qquad \pi\agivenb{a=\text{wait}}{s} = 0.5 \qquad \forall s\in\Sc. \]
Applied to the given environment behavior \[ \begin{align*} p\agivenb{s'}{s,a=\text{cut}} &= \begin{bmatrix} 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{bmatrix}, & \quad p\agivenb{s'}{s,a=\text{wait}} &= \begin{bmatrix} 0 & 1-\alpha & 0 & \alpha \\ 0 & 0 & 1-\alpha & \alpha \\ 0 & 0 & 1-\alpha & \alpha \\ 0 & 0 & 0 & 1 \end{bmatrix} \\ \hat r^{a=\text{cut}}&=\begin{bmatrix} 1 & 2 & 3 & 0 \end{bmatrix}^\top, & \qquad \hat r^{a=\text{wait}}&=\begin{bmatrix} 0 & 0 & 1 & 0 \end{bmatrix}^\top, \end{align*} \]
We get \[p^\pi\agivenb{s'}{s} = \begin{bmatrix} 0 & \frac{1-\alpha}{2} & 0 & \frac{1+\alpha}{2} \\ 0 & 0 & \frac{1-\alpha}{2} & \frac{1+\alpha}{2} \\ 0 & 0 & \frac{1-\alpha}{2} & \frac{1+\alpha}{2} \\ 0 & 0 & 0 & 1 \end{bmatrix}.\]
Using the Bellman expectation from \(\eqref{eq:MDP_BellmanQ}\), the action values can be calculated directly: Â
The optimal state-value function of an MDP is the maximum state-value function over all policies: \[ V^*(s) = \max_\pi V^\pi(s) \qquad \forall s\in\Sc. \]
The optimal state-action function of an MDP is the maximum action-value function over all policies: \[ Q^*(s,a) = \max_\pi Q^\pi(s,a) \qquad \forall s\in\Sc, a\in\Ac(s). \]
Define a partial ordering over polices \[\pi\geq\pi'\qquad \text{if} \qquad V^\pi(s) \geq V^{\pi'}(s)\quad\forall s\in\Sc.\]
For any finite MDP
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
Start with \(V(s=4)\) and continue backwards \[ \begin{align*} V^*(4) &= 0 \\ V^*(3) &= \max \begin{cases} 1 + \gamma \left[ (1-\alpha) V^*(3) + \alpha V^*(4) \right] \\ 3 + \gamma V^*(4) \end{cases} \\ &= \max \begin{cases} 1 + \gamma \left[ (1-\alpha) V^*(3)\right] \\ 3 \end{cases} \\ V^*(2) &= \max \begin{cases} 0 + \gamma \left[ (1-\alpha) V^*(3) + \alpha V^*(4) \right] \\ 2 + \gamma V^*(4) \end{cases} \\ &= \max \begin{cases} \gamma \left[ (1-\alpha) V^*(3) \right] \\ 2 \end{cases} \\ V^*(1) &= \max \begin{cases}\gamma \left[ (1-\alpha) V^*(2) \right] \\ 1 \end{cases} \\ \end{align*} \]