Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
\(^*\)We will have a discussion on the notation and random variables in the next lecture (MDPs)
Question: How do we pick the action \(a_t\)? \(\Rightarrow\) just take the max!
\[a_t=\arg\max_a Q_t(a)\]
(with ties broken arbitrarily)
a
What is \(\epsilon\)’s job? \(\Rightarrow\) the exploration
# Initialization
for i in range(k):
Q(a) = 0
N(a) = 0
# Run forever
while(True):
# exploration versus exploitation
if rand() > epsilon:
a = argmax(Q) # exploitation
else:
a = randint(k) # exploration
r = bandit(a)
# Error correction towards target r
N(a) = N(a) + 1
Q(a) = Q(a) + (1/N(a)) * (r - Q(a))Calculating \(Q(a)\) anew every time appears to be expensive. For a single action, consider that we have received a sequence of rewards \(r_1,\ldots,r_{t-1}\) so far: \[ Q_t = \frac{r_1 + r_2 + \ldots + r_{t-1}}{t-1}\] Wouldn’t it be easier if we could just updated incrementally? \(\Rightarrow\) let’s try
\[ \begin{eqnarray*} Q_{t+1} &=& \frac{1}{t}\sum_{i=1}^t r_i \fragment{=\frac{1}{t}\left(r_t + \sum_{i=1}^{t-1} r_i \right)} \fragment{= \frac{1}{t}\left(r_t + (t-1)\frac{1}{t-1} \sum_{i=1}^{t-1} r_i \right)} \fragment{= \frac{1}{t}\left(r_t + (t-1)Q_t \right)} \fragment{= \frac{1}{t}\left(r_t + t Q_t - Q_t \right)}\\ &=& Q_t + \frac{1}{t}\left(r_t - Q_t \right) \end{eqnarray*} \]
The expression [ Target - OldEstimate ] is an error estimate, used to steer us closer to the target
We will see a formulae of the type \[ \mathsf{NewEstimate} \leftarrow \mathsf{OldEstimate} + StepSize \; [ \mathsf{Target} - \mathsf{OldEstimate} ] \] frequently from now on!
\[ \begin{eqnarray*} Q_{t+1} = Q_t + \alpha [r_t - Q_t] &=&\alpha r_t + (1-\alpha) Q_t \\ &=& \alpha r_t + (1-\alpha) [\alpha r_{t-1} + (1-\alpha) Q_{t-1}] \fragment{= \alpha r_t + (1-\alpha) \alpha r_{t-1} + (1-\alpha)^2 Q_{t-1}} \\ &=& \alpha r_t + (1-\alpha) \alpha r_{t-1} + (1-\alpha)^2 \alpha r_{t-2} + \ldots + (1-\alpha)^t r_{1} \\ &=& (1-\alpha)^t r_{1} + \sum_{i=1}^t \alpha (1-\alpha)^{t-i}r_i \end{eqnarray*} \]
Two topics (and also many more) topics that we have not discussed:
For details, see (Sutton and Barto 1998)