Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
We begin by learning \(V^\pi\) for a given policy \(\pi\).
initialize
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) \(g = 0\)
\(\quad\) Generate a sequence following \(\pi\): \[((s_0,a_0,r_0),(s_1,a_1,r_1),\ldots,(s_{T_k-1},a_{T_k-1},r_{T_k-1}))\] \(\quad\) for \(t \in \{T_k-1,T_k-2,T_k-3,\ldots,0\}\):
\(\quad\quad\) \(g = \gamma g+ r_t\)
\(\quad\quad\) if \(s_t \notin \{s_0,\ldots,s_{t-1}\}\): \(\qquad\) (that’s the first-visit condition)
\(\quad\quad\quad\) Append \(g\) to \(\ell(s_t)\)
\(\quad\quad\quad\) \(V(s_t) = \mathsf{average}(\ell(s_t))\)
Convergence? Yes, for every state \(s\in\Sc\) that is visited infinitely often (\(\rightarrow\) exploration!)
initialize
for \(k = 1, 2, \ldots, K\) episodes:
\(\quad\) \(g = 0\)
\(\quad\) Generate a sequence following \(\pi\): \[((s_0,a_0,r_0),(s_1,a_1,r_1),\ldots,(s_{T_k-1},a_{T_k-1},r_{T_k-1}))\] \(\quad\) for \(t \in \{T_k-1,T_k-2,T_k-3,\ldots,0\}\):
\(\quad\quad\) \(g = \gamma g+ r_t\)
\(\quad\quad\) \(\cancel{\textbf{if}~ s_t \notin \{s_0,\ldots,s_{t-1}\}:}\) \(\qquad\) (only difference to first-visit)
\(\quad\quad\) Append \(g\) to \(\ell(s_t)\)
\(\quad\quad\) \(V(s_t) = \mathsf{average}(\ell(s_t))\)
We have a small robot in a gridworld that wants to recharge.
If we know the model (i.e., \(p\)), we can use the iterative policy evaluation algorithm:
Now: Every-visit MC prediction of \(V^\pi\), for \(\pi\agivenb{\cdot}{s} = [0.25, 0.25, 0.25, 0.25]^\top ~\forall~ s\in\Sc\).
Is a model available (i.e., the MDP \((\Sc, \Ac, p, r, \gamma)\) is known)?
Estimating \(Q^\pi(s,a)\) using MC:
Possible challenges?
We learned about the Generalized policy iteration (GPI) in the last lecture.
Let’s use the same procedure here: \[ \pi_0 \stackrel{E}{\longrightarrow} Q^{\pi_0} \stackrel{I}{\longrightarrow} \pi_1 \stackrel{E}{\longrightarrow} Q^{\pi_1} \ldots \stackrel{I}{\longrightarrow} \pi^* \stackrel{E}{\longrightarrow} Q^*=Q^{\pi^*}. \]
Policy improvement: Make the policy greedy w.r.t. the current value function:
Since we have the action-value function \(Q\), we don’t need a model to construct the policy: \[ \pi(s) = \arg\max_{a\in\Ac} Q(s,a). \]
Even if we’re operating in an unknown MDP, the policy improvement theorem remains valid for MC-based control due to the underlying MDP structure:
Policy improvement theorem for MC-based control \[ \begin{align*} Q^{\pi_k}(s,\pi_{k+1}(s)) &= Q^{\pi_k}\left(s,\arg\max_{a\in\Ac} Q^{\pi_k}(s,a)\right) \\ &=\max_{a\in\Ac} Q^{\pi_k}(s,a) \\ &\geq Q^{\pi_k}(s,\pi_k(s)) \\ &\geq V^{\pi_k}(s). \end{align*} \]
initialize
for \(k = 1, 2, \ldots, K\) episodes (or until \(\pi\) converges):
\(\quad\) \(g = 0\)
\(\quad\) Choose \(s_0\in\Sc\) and \(a_0\in\Ac\) randomly such that all pairs have probability \(>0\)
\(\quad\) Generate a sequence starting at \((s_0, a_0)\) and following \(\pi\): \[((s_0,a_0,r_0),(s_1,a_1,r_1),\ldots,(s_{T_k-1},a_{T_k-1},r_{T_k-1}))\] \(\quad\) for \(t \in \{T_k-1,T_k-2,T_k-3,\ldots,0\}\):
\(\quad\quad\) \(g = \gamma g+ r_t\)
\(\quad\quad\) if \((s_t,a_t) \notin \{(s_0,a_0),\ldots,(s_{t-1},a_{t-1})\}\):
\(\quad\quad\quad\) Append \(g\) to \(\ell(s_t)\)
\(\quad\quad\quad\) \(Q(s_t,a_t) = \mathsf{average}(\ell(s_t,a_t))\)
\(\quad\quad\quad\) \(\pi(s_t) = \arg\max_{a\in\Ac}Q(s_t, a)\)
đź’ˇ if a problem is non-stationary, we can use a step size \(\alpha\in(0,1]\) as in the multi-armed bandit setting: \(Q(s_t,a_t) = Q(s_t,a_t) + \alpha \left[g - Q(s_t,a_t)\right]\).
initialize
for \(k = 1, 2, \ldots, K\) episodes (or until \(\pi\) converges):
\(\quad\) \(g = 0\)
\(\quad\) Choose \(s_0\in\Sc\) and \(a_0\in\Ac\) randomly such that all pairs have probability \(>0\)
\(\quad\) Generate a sequence starting at \((s_0, a_0)\) and following \(\pi\): \[((s_0,a_0,r_0),(s_1,a_1,r_1),\ldots,(s_{T_k-1},a_{T_k-1},r_{T_k-1}))\] \(\quad\) for \(t \in \{T_k-1,T_k-2,T_k-3,\ldots,0\}\):
\(\quad\quad\) \(g = \gamma g + r_t\)
\(\quad\quad\) if \((s_t,a_t) \notin \{(s_0,a_0),\ldots,(s_{t-1},a_{t-1})\}\):
\(\quad\quad\quad\) \(n(s_t) = n(s_t) + 1\) \(\qquad\qquad\qquad\qquad\qquad\qquad\qquad\) (appending \(g\) to the list of returns)
\(\quad\quad\quad\) \(Q(s_t,a_t) = Q(s_t,a_t) + \frac{1}{n(s_t,a_t)} \left[g - Q(s_t,a_t)\right]\)\(\qquad\quad\) (averaging over the list of returns )
\(\quad\quad\quad\) \(\pi(s_t) = \arg\max_{a\in\Ac}Q(s_t, a)\)
Let’s apply the Monte Carlo Exploring Starts algorithm to two gridworlds of different sizes:
On-Policy: “Learning by Doing”
Off-Policy: “Learning by Observing”
Question: Which class does the Monte Carlo ES algorithm belong to?
Answer: It is on-policy! Each time we update our policy (the counter \(k\) for the episodes), we collect the returns anew, following the current policy \(\pi\).
We will cover the topic on-policy versus off-policy in more detail later in this course.
| Feature | On-policy | Off-policy |
|---|---|---|
| Learning Source | Learns from its current actions only. | Can learn from any data (past, random, or expert). |
| Exploration | Often more stable but can get stuck in “safe” habits. | More aggressive at finding the absolute best strategy. |
| Efficiency | Lower. Throws away data once the policy changes. | Higher. Can “re-use” old experiences (Experience Replay). |
Motivating question: How do we get rid of the restrictive (and often not achievable) requirement of exploring starts (ES)?
Where have we seen something like this before? \(\Rightarrow\) In the multi-armed bandits lecture!
\(\epsilon\)-greedy on-policy learning
initialize
for \(k = 1, 2, \ldots, K\) episodes (or until \(\pi\) converges):
\(\quad\) \(g = 0\)
\(\quad\) Choose \(s_0\in\Sc\) and \(a_0\in\Ac\) randomly such that all pairs have probability \(>0\)
\(\quad\) Generate a sequence \(\set{(s_t,a_t,r_t)}_{t=1}^{T_k}\) starting at \((s_0, a_0)\) and following \(\pi\)
\(\quad\) for \(t \in \{T_k-1,T_k-2,T_k-3,\ldots,0\}\):
\(\quad\quad\) \(g = \gamma g + r_t\)
\(\quad\quad\) if \((s_t,a_t) \notin \{(s_0,a_0),\ldots,(s_{t-1},a_{t-1})\}\):
\(\quad\quad\quad\) \(n(s_t) = n(s_t) + 1\)
\(\quad\quad\quad\) \(Q(s_t,a_t) = Q(s_t,a_t) + \frac{1}{n(s_t,a_t)} \left[g - Q(s_t,a_t)\right]\)
\(\quad\quad\quad\) \(\tilde a = \arg\max_{a\in\Ac}Q(s_t, a)\)
\(\quad\quad\quad\) \(\pi\agivenb{a}{s_t} = \begin{cases} 1-\epsilon+\epsilon/\abs{\Ac}, & a = \tilde{a} \\ \epsilon/\abs{\Ac}, & a \neq \tilde{a} \end{cases}\)
Given an MDP, an \(\epsilon\)-greedy policy \(\pi'\) w.r.t. \(Q^\pi\) is an improvemnt over the \(\epsilon\)-soft policy \(\pi\), i.e., \(V^{\pi^\prime} > V^\pi\) for all \(s\in\Sc\).
Small proof: \[ \begin{align*} V^{\pi^\prime}(s)=Q^\pi(s,\pi'(s)) &= \sum_{a\in\Ac} \pi'\agivenb{a}{s} Q^\pi(s,a) \\ &= \frac{\epsilon}{\abs{\Ac}}\left(\sum_{a\in\Ac} Q^\pi(s,a)\right) + (1-\epsilon) \max_{a\in\Ac} Q^\pi(s,a)\\ &\geq \frac{\epsilon}{\abs{\Ac}}\left(\sum_{a\in\Ac} Q^\pi(s,a)\right) + (1-\epsilon) \left( \sum_{a\in\Ac} \frac{\pias-\frac{\epsilon}{\abs{\Ac}}}{1-\epsilon} Q^\pi(s,a)\right)\\ \textit{(Last term: weighted }&\text{average over all actions. It thus has to be smaller than the max operation.)}\\ &=\frac{\epsilon}{\abs{\Ac}}\left(\sum_{a\in\Ac} Q^\pi(s,a)\right) - \frac{\epsilon}{\abs{\Ac}}\left(\sum_{a\in\Ac} Q^\pi(s,a)\right) + \sum_{a\in\Ac} \pias Q^\pi(s,a)\\ &= V^\pi(s). \end{align*} \]
A learning policy \(\pi\) is called GLIE if it satisfies the following two properties:
MC-based control using \(\epsilon\)-greedy exploration is GLIE, if \(\epsilon\) is decreased at rate \[ \epsilon_k=\frac{1}{k}, \] with \(k\) being the increasing episode index. In this case, \[ Q^\pi(s,a)=Q^*(s,a). \]
Remarks:
TBD
All learning control methods face a dilemma:
The off-policy learning idea:
Requirement:
The central concept: re-weighting
Caveat: if \(\rho\) is large (distinctly different policies) the estimate’s variance is large (i.e., uncertain for small numbers of samples).
Given a starting state \(s_t\), the probability of the subsequent state–action trajectory \((a_t, s_{t+1},a_{t+1}, \ldots ,s_T)\) under a policy \(\pi\) is \[ \begin{align*} p\agivenb{a_t, s_{t+1},a_{t+1}, \ldots ,s_T}{s_t,\pi} &= \pi\agivenb{a_t}{s_t}p\agivenb{s_{t+1}}{s_t,a_t}\fragment{\pi\agivenb{a_{t+1}}{s_{t+1}}p\agivenb{s_{t+2}}{s_{t+1},a_{t+1}}}\fragment{\ldots p\agivenb{s_{T}}{s_{T-1},a_{T-1}}}\\ &= \prod_{k=t}^{T-1} \pi\agivenb{a_k}{s_k}p\agivenb{s_{k+1}}{s_{k},a_{k}}. \end{align*} \]
The relative probability of a trajectory under the target and behavior policy, the importance sampling ratio, from sample step \(k\) to \(T\) is \[ \rho_{k:T} = \frac{\prod_{k=t}^{T-1} \pi\agivenb{a_k}{s_k}p\agivenb{s_{k+1}}{s_{k},a_{k}}}{\prod_{k=t}^{T-1} b\agivenb{a_k}{s_k}p\agivenb{s_{k+1}}{s_{k},a_{k}}} \fragment{= \frac{\prod_{k=t}^{T-1} \pi\agivenb{a_k}{s_k}\cancel{p\agivenb{s_{k+1}}{s_{k},a_{k}}}}{\prod_{k=t}^{T-1} b\agivenb{a_k}{s_k}\cancel{p\agivenb{s_{k+1}}{s_{k},a_{k}}}}} \fragment{= \frac{\prod_{k=t}^{T-1} \pi\agivenb{a_k}{s_k}}{\prod_{k=t}^{T-1} b\agivenb{a_k}{s_k}}.} \]
💡 Although the trajectory probabilities depend on the MDP’s transition probabilities \(p\) (which are generally unknown), the importance sampling ratio ends up depending only on the two policies and the sequence, not on the MDP.
Estimating the state value \(V^\pi\) following a behavior policy \(b\) using ordinary importance sampling (OIS) results in scaling and averaging the sampled returns by the importance sampling ratio per episode: \[ \begin{equation} V^\pi(s) = \frac{\sum_{t\in\mathcal{T}(s)}\rho_{k:T(t)}\, g_t}{\abs{\mathcal{T}(s)}}. \tag{OIS} \label{eq:MC_OIS} \end{equation} \] Here \(\mathcal{T}(s)\) is the set of all time steps in which \(s\) is visited, and $T(t) is the termination of the episode starting at \(t\).
Estimating the state value \(V^\pi\) following \(b\) using weighted importance sampling (WIS) results in a slightly different scaling: \[ \begin{equation} V^\pi(s) = \frac{\sum_{t\in\mathcal{T}(s)}\rho_{k:T(t)}\, g_t}{\sum_{t\in\mathcal{T}(s)}\rho_{k:T(t)}}. \tag{WIS} \label{eq:MC_WIS} \end{equation} \]
Main differences between the two (first-visit) versions:
If we want to perform weighted importance sampling in an in an incremental fashion, we need to keep track of all the weights in \(\eqref{eq:MC_WIS}\). Let’s do this exemplary for the value function \(V_k\) at iteration \(k\): \[ V_k = \frac{\sum_{t=1}^{n} w_t g_t}{\sum_{t=1}^{n} w_t}, \qquad \text{where}\quad w_t=\rho_{k:T(t)}. \] In addition to keeping track of \(V_k\), we must maintain for each state the cumulative sum \(c_k\) of the weights given to the first \(k\) returns. The update rule for \(V_k\) is then: \[ V_{k+1} = V_k + \frac{w_k}{c_k}\left[g_k - V_k\right] \qquad \fragment{\text{and} \qquad c_{k+1} = c_k + w_{k+1},} \] with \(c_0=0\).
initialize (for all \(s \in \Sc\), \(a\in\Ac\))
for \(k = 1, 2, \ldots, K\) episodes (or until \(\pi\) converges):
\(\quad\) \(g = 0\), \(w=1\)
\(\quad\) Generate \(\set{(s_t,a_t,r_t)}_{t=0}^{T_k}\) following a soft policy \(b\)
\(\quad\) for \(t \in \{T_k-1,T_k-2,T_k-3,\ldots,0\}\):
\(\quad\quad\) \(g = \gamma g + r_t\)
\(\quad\quad\) \(c(s_t,a_t) = c(s_t,a_t) + w\)
\(\quad\quad\) \(Q(s_t,a_t) = Q(s_t,a_t) + \frac{w}{c(s_t,a_t)} \left[g - Q(s_t,a_t)\right]\)
\(\quad\quad\) \(w = w \frac{\pi\agivenb{a_t}{s_t}}{b\agivenb{a_t}{s_t}}\)
Variation: On-policy
We now have our final algorithm that includes everything we have learned so far
initialize (for all \(s \in \Sc\), \(a\in\Ac\))
for \(k = 1, 2, \ldots, K\) episodes (or until \(\pi\) converges):
\(\quad\) \(g = 0\), \(w=1\)
\(\quad\) Generate \(\set{(s_t,a_t,r_t)}_{t=0}^{T_k}\) following a soft policy \(b\)
\(\quad\) for \(t \in \{T_k-1,T_k-2,T_k-3,\ldots,0\}\):
\(\quad\quad\) \(g = \gamma g + r_t\)
\(\quad\quad\) \(c(s_t,a_t) = c(s_t,a_t) + w\)
\(\quad\quad\) \(Q(s_t,a_t) = Q(s_t,a_t) + \frac{w}{c(s_t,a_t)} \left[g - Q(s_t,a_t)\right]\)
\(\quad\quad\) \(\pi(s_t) = \arg\max_{a\in\Ac}Q(s_t,a)\) (with ties broken consistently)
\(\quad\quad\) if \(\pi(s_t)\neq a_t\):
\(\quad\quad\quad\) Exit inner loop and proceed to next episode
\(\quad\quad\) \(w = w \frac{1}{b\agivenb{a_t}{s_t}}\)