Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
Dynamic Programming (DP) refers to two central factors:
Further characteristics:
We here consider finite state and action spaces \(\Sc\) and \(\Ac\). In the last lecture, we have learned about the return, the value function and the Bellman equation
\[ \begin{eqnarray} g_t &=& r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^\infty \gamma^k r_{t+k+1}, \notag \\ V^\pi(s_t) &=& \ExpCsub{g_t}{s_t}{\pi} = \ExpC{r_{t+1}}{s_t} + \gamma\ExpCsub{V(s_{t+1})}{s_t}{\pi}, \notag \\ V^\pi(s_t) &=& \ExpCsub{r_{t+1}+\gamma V^\pi(s_{t+1})}{s_t}{\pi}. \label{eq:DP_BellmanV} \end{eqnarray} \]
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
In other/modern words: “the tail of an optimal sequence is optimal for the tail subproblem” (Bertsekas 2019)
From this, we derived the Bellman optimality equation (Bellman 1954): \[ \begin{align} V^*(s) &= \max_{a\in\Ac} \ExpC{r_{t+1}+\gamma V^*(s_{t+1})}{s_t = s, a_t = a} \notag \\ &= \max_{a\in\Ac} \sum_{s_{t+1}\in\Sc} p\agivenb{s_{t+1}}{s_t = s, a_t = a}\left[ r_{t+1} + \gamma V^*(s_{t+1}) \right]. \label{eq:DP_BellmanVstar} \end{align} \]
Similarly, we can derive an optimal Q-function: \[ \begin{align} Q^*(s,a) &= \ExpC{r_{t+1}+\gamma \max_{a'\in\Ac} Q^*(s_{t+1}, a')}{s_t = s, a_t = a} \notag \\ &= \sum_{s_{t+1}\in\Sc} p\agivenb{s_{t+1}}{s_t = s, a_t = a}\left[ r_{t+1} + \gamma \max_{a'\in\Ac} Q^*(s_{t+1}, a') \right]. \label{eq:DP_BellmanQstar} \end{align} \]
Central idea of DP: turn these equations into algorithmic assignments.
Question: In this example, what is (in the context of RL)
In all chapters, we will distinguish between two core learning tasks in reinforcement learning:
Let’s start with the prediction problem: For a given policy \(\pi\), how do we compute \(V^\pi\)?
Let’s revisit \(\eqref{eq:DP_BellmanV}\):
\[ \begin{equation} V^\pi(s) = \ExpCsub{r+\gamma V^\pi(s')}{s}{\pi} \fragment{= \sum_{a\in\Ac} \pias \sum_{s'\in\Sc} \psprimesa \left[ r + \gamma V^\pi(s') \right]} \label{eq:DP_BellmanV2} \end{equation} \]
Existence and uniqueness of \(V^\pi\) are guaranteed as long as either \(\gamma < 1\) or eventual termination is guaranteed from all states under the policy \(\pi\).
Use the Bellman equation \(\eqref{eq:DP_BellmanV2}\) as an update rule: \[\begin{equation} V_{\textcolor{red}{k+1}}(s) = \ExpCsub{r+\gamma V_{\textcolor{red}{k}}(s')}{s}{\pi} = \sum_{a\in\Ac} \pias \sum_{s'\in\Sc} \psprimesa \left[ r + \gamma V_{\textcolor{red}{k}}(s') \right] \label{eq:DP_IterativePolicyEvaluation} \end{equation}\]
A more memory-efficient version (that tends to converge faster): in-place updates!
Input: policy \(\pi\)
Parameters: a small threshold \(\theta > 0\) determining accuracy of estimation
Initialize: \(V(s)\) arbitrarily for \(s \in \Sc\), \(V(\mathsf{terminal}) = 0\)
while (True):
\(\quad\) \(\Delta=0\)
\(\quad\) for \(s \in \Sc\):
\(\quad\quad\) \(V_{\mathsf{old}}(s) = V(s)\)
\(\quad\quad\) \(V(s) = \sum_{a\in\Ac} \pias \sum_{s'\in\Sc} \psprimesa \left[ r + \gamma V(s') \right]\)
\(\quad\quad\) \(\Delta = \max(\Delta, \abs{V_{\mathsf{old}}(s)-V(s)})\)
\(\quad\) if \(\Delta < \theta\) then break
Many of you will know bootstrapping from supervised learning:
In reinforcement learning, bootstrapping has a different meaning:
\(^*\) i.i.d. = independent and identically distributed.
We have a small robot in a gridworld that wants to recharge.
\(\bullet\) Initial state \(s_0\): a random valid field.
\(\bullet\) Goal: reach the battery (\(r=1\), otherwise \(r=0\)).
\(\bullet\) \(\Ac=\set{\uparrow, \downarrow, \leftarrow, \rightarrow}\) (leaving or invalid field \(\Rightarrow\) no movement).
\(\bullet\) \(\pi\agivenb{\cdot}{s} = [0.25, 0.25, 0.25, 0.25]^\top ~\forall~ s\in\Sc\).
\(\bullet\) Discount: \(\gamma = 0.8\)
Random policy
\(\bullet\) Terminal state: If the robot hits the battery, it receives \(r=1\) (potentially discounted) and the episode ends.
\(\bullet\) Optimal strategy: Move towards the battery as quickly as possible.
\[\begin{align*} Q^\pi(s,a) &= \ExpCsub{r + \gamma V^\pi(s')}{s,a}{\pi} \\ &= \sum_{s'\in\Sc} \psprimesa \left[r + \gamma V^\pi(s')\right] . \end{align*}\]
\(\Rightarrow\) If \(Q^\pi(s,a)>V^\pi(s)\), then choosing \(a\) over \(\pi(s)\) in state \(s\) yields a better policy.
\(\Rightarrow\) This is a special instance of the policy improvement theorem.
We’re here using some arguments for deterministic policies (i.e., \(a=\pi(s)\)), but the arguments also hold in the probabilistic setting.
Theorem: Let \(\pi\) and \(\pi'\) be any pair of deterministic policies such that, for all \(s \in \Sc\), \[ \sum_{a\in\Ac} \pi'\agivenb{a}{s} Q^\pi(s,a) \geq V^\pi(s). \] Then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s \in \Sc\).
Proof: \[ \begin{equation} V^\pi(s_t) \leq \sum_{a_t\in\Ac} \pi'\agivenb{a_t}{s_t} Q^\pi(s_t,a_t) \fragment{= \sum_{a_t\in\Ac} \pi'\agivenb{a_t}{s_t}\sum_{s_{t+1}\in\Sc} \pstplusstat \left[r_t + \gamma V^\pi(s_{t+1})\right] } \fragment{= \ExpCsub{r_t+\gamma V^\pi(s_{t+1})}{s_t}{\pi'}}\label{eq:DP_policy_improvement_theorem_1} \end{equation} \]
Since this holds for all \(s\in\Sc\), we can apply the same procedure at \(s_{t+1}\) \(\Rightarrow\) \(V^\pi(s_{t+1}) \leq \ExpCsub{r_{t+1}+\gamma V^\pi(s_{t+2})}{s_{t+1}}{\pi'}.\)
\[ \text{Substitute into \eqref{eq:DP_policy_improvement_theorem_1}:}\qquad V^\pi(s_t) \leq \ExpCsub{r_t+\gamma \ExpCsub{r_{t+1}+\gamma V^\pi(s_{t+2})}{s_{t+1}}{\pi'}}{s_t}{\pi'}\qquad\qquad\qquad~~ \]
Using the linearity of expectations and the law of total expectation (\(\Exp{\ExpC{x}{y}} = \Exp{x}\)): \[ V^\pi(s_t) \leq \ExpCsub{r_t+\gamma r_{t+1}+\gamma^2 V^\pi(s_{t+2})}{s_t}{\pi'}. \]
\[ \text{Repeat infinitely often:}\qquad V^\pi(s_t) \leq \ExpCsub{r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \ldots}{s_t}{\pi'} \fragment{=V^{\pi'}(s_t) \qquad \square} \]
When do we stop?
Conclusion: Policy improvement yields a strictly better policy, unless \(\pi\) is already optimal!
đź’ˇ in the stochastic setting, each maximizing action can be given a non-zero probability in \(\pias\). Any apportioning scheme is allowed as long as all submaximal actions are given zero probability.
This way of finding an optimal policy is called policy iteration.
đź’ˇ since each policy evaluation is itself an iterative computation, we start it with the value function for the previous policy. This typically results in a great increase in the speed of convergence of policy evaluation (presumably because the value function changes little from one policy to the next).
Policy evaluation:
while (True):
\(\quad\) \(\Delta = 0\)
\(\quad\) for \(s \in \Sc\):
\(\quad\quad\) \(V_{\mathsf{old}} = V(s)\)
\(\quad\quad\) \(V(s) = \sum_{a\in\Ac} \pias \sum_{s'\in\Sc} \psprimesa \left[ r + \gamma V(s') \right]\)
\(\quad\quad\) \(\Delta = \max(\Delta, \abs{V_{\mathsf{old}}-V(s)})\)
\(\quad\) if \(\Delta < \theta\) then break
Policy improvement:
\(\mathsf{flag}_{\mathsf{stable}} = true\)
for \(s \in \Sc\):
\(\quad\) \(\pi_{\mathsf{old}}(s) = \pi(s)\)
\(\quad\) \(\pi(s) = \arg\max_{a\in\Ac}\sum_{s'\in\Sc} \psprimesa \left[r + \gamma V^\pi(s')\right]\)
\(\quad\) if \(\pi(s) \neq \pi_{\mathsf{old}}(s)\) then \(\mathsf{flag}_{\mathsf{stable}} = false\)
if \(\mathsf{flag}_{\mathsf{stable}} = true\) then return \(V\approx V^*\) and \(\pi \approx \pi^*\) else go back to 2. Policy evaluation
Let’s look at our earlier derivation of the policy evaluation procedure:
\[\begin{equation} V_{k+1}(s) = \ExpCsub{r+\gamma V_{k}(s')}{s}{\textcolor{blue}{\pi}} = \textcolor{blue}{\sum_{a\in\Ac} \pias} \sum_{s'\in\Sc} \psprimesa \left[ r + \gamma V_{k}(s') \right] \tag{\ref{eq:DP_IterativePolicyEvaluation}} \end{equation}\]
Now, we change this to directly update the value function using the maximizing action (policy improvement step): \[\begin{equation} V_{k+1}(s) = \textcolor{red}{\max_{a\in\Ac}}\ExpC{r+\gamma V_{k}(s')}{s,a} = \textcolor{red}{\max_{a\in\Ac}}\sum_{s'\in\Sc} \psprimesa \left[ r + \gamma V_{k}(s') \right] \end{equation}\]
Let’s compare this to the Bellman equation: \[\begin{align} V^*(s) = \max_{a\in\Ac} \ExpC{r+\gamma V^*(s')}{s,a} = \max_{a\in\Ac} \sum_{s'\in\Sc} \psprimesa\left[ r + \gamma V^*(s') \right]. \tag{\ref{eq:DP_BellmanVstar}} \end{align}\]
\(\Rightarrow\) We have simply turned \(\eqref{eq:DP_BellmanVstar}\) into an iterative procedure!
\(\Rightarrow\) Value iteration can be shown to converge for arbitrary \(V_0\)
Parameters: a small threshold \(\theta > 0\) determining accuracy of estimation
Initialize: \(V(s)\) arbitrarily for \(s \in \Sc\), \(V(\mathsf{terminal}) = 0\)
while (\(\Delta > \theta\)):
\(\quad\) \(\Delta=0\)
\(\quad\) for \(s \in \Sc\):
\(\quad\quad\) \(V_{\mathsf{old}}(s) = V(s)\)
\(\quad\quad\) \(V(s) = \max_{a\in\Ac} \sum_{s'\in\Sc} \psprimesa \left[ r + \gamma V(s') \right]\)
\(\quad\quad\) \(\Delta = \max(\Delta, \abs{V_{\mathsf{old}}(s)-V(s)})\)
\(\quad\) if \(\Delta < \theta\) then break