Prof. Dr. Sebastian Peitz
Chair of Safe Autonomous Systems, TU Dortmund
Content:
Useful references:
| Chapter | Topic | Content |
|---|---|---|
| Basics & tabular methods | ||
| 1-5 | Bandits, MDPs, Dynamic Programming, Monte Carlo, TD Learning | RL basics in finite dimensions |
| Deep-learning-based methods | ||
| 6 | Brief introduction to deep learning | The basics for what comes next |
| 7 | Value function approximation | |
| 8 | Deep \(Q\)-learning | |
| 9 | Policy gradients | |
| 10 | Actor-critic algorithms | |
| 11 | Advanced algorithms (Part I): From policy gradient to PPO | |
| 12 | Advanced algorithms (Part II): From \(Q\)-learning to Soft Actor-Critic | |
| Model-Based Control | ||
| Advanced Topics |
Given a labeled (and probably noisy) dataset \(\Dtrain=\{(x_1,y_1),\ldots,(x_N,y_N)\}\), approximate the unknown mapping \(f : \Xc \to \Yc\) by a parametrizable ML model \(f_\theta: \Xc \to \Yc\), such that \[ f_\theta(x_k) = \hat{y}_k \approx y_k \quad \forall ~ (x_k,y_k)\in\Dtest .\]
\(\bullet\) In the ML context, bias denotes the error of the average model \(\overline{f}_{\theta}\) when repeating the training with different datasets \(\Dc_{\mathsf{train},1},\Dc_{\mathsf{train},2},\ldots\): \[ \mathsf{bias} = \Expsub{\left(\overline{f}_{\theta}(x) - f(x)\right)^2}{x\sim\Dtest}. \]
\(\bullet\) variance denotes the variability in between the individual training runs: \[ \mathsf{variance} = \Expsub{\Expsub{\left(f^{(\Dc)}_{\theta}(x) - \overline{f}_{\theta}(x)\right)^2}{\Dc\sim\{\Dc_{\mathsf{train,\ell}}\}_{\ell=1}^\infty}}{x\sim\Dtest}. \]
\(\Rightarrow\) Often a matter of model complexity.
Low bias, low variance
Large bias, low variance
Low bias, large variance
Large bias, large variance [Wikipedia]
Example (see Wikipedia): Fitting a model of serveral radial basis functions to noisy trainig data: \[ f_\theta(x) = \sum_{k=1}^d \theta_k \exp\left(-\frac{1}{2}\frac{x-c_k}{\sigma_k^2}\right). \]
\(\bullet\) For a wide spread (i.e., large \(\sigma_k\)), the bias is high: the RBFs cannot fully approximate the function (especially the central dip), but the variance between different trials is low.
\(\bullet\) As spread decreases (image 3 and 4) the bias decreases: the blue curves more closely approximate the red…
\(\bullet\) … but the variance between trials (\(\Dc_{\mathsf{train,1}},\Dc_{\mathsf{train,2}},\ldots\)) increases.
Wide spread RBFs.
Medium spread RBFs.
Small spread RBFs.
Artificial neural networks (ANNs) are nonlinear function approximators \(\hat{y}=f_\theta(x)\) that
Standard model of supervised learning: multi-layer perceptron or feed-forward ANN.
Â
Â
\(^*\) Without nonlinear activation functions, every ANN collapses to a single matrix-vector multiplication \(y=\hat\Theta x\): \[\iterate{x}{\ell+2}=\iterate{\Theta}{\ell+2}\iterate{x}{\ell+1}= \iterate{\Theta}{\ell+2}\left(\iterate{\Theta}{\ell+1}\iterate{x}{\ell}\right) = \left(\iterate{\Theta}{\ell+2}\iterate{\Theta}{\ell+1}\right)\iterate{x}{\ell}= \hat\Theta \iterate{x}{\ell}.\] For non-zero biases, we obtain an affine transformation instead.
Source
This is called the backpropagation algorithm. We require one forward pass and one backward pass for every data tuple \((x_k,y_k)\). Taking the average gives us the average steepest-descent improvement over the dataset \(\Dtrain\) for the current \(\theta\).
Let’s consider a very simple network with \(x,y\in\R\) and two hidden layers with a single neuron each.
Symbolic
\[ \stackrel{x}{\bigcirc} \underbrace{ \underbrace{\stackrel{\theta_1}{\longrightarrow} \stackrel{\iterate{z}{1}}{\bigcirc} \stackrel{\sigma}{\longrightarrow}}_{\text{Hidden\\ Layer}~1} \stackrel{\iterate{x}{1}}{\bigcirc} \underbrace{\stackrel{\theta_2}{\longrightarrow} \stackrel{\iterate{z}{2}}{\bigcirc} \stackrel{\sigma}{\longrightarrow}}_{\text{Hidden\\ Layer}~2} \stackrel{\iterate{x}{2}}{\bigcirc} \underbrace{\stackrel{\theta_3}{\longrightarrow}}_{\text{Output\\ layer}} }_{\hat{y}=f_\theta(x)} \stackrel{\hat{y}}{\bigcirc} \stackrel{L}{\longrightarrow} \stackrel{\text{Loss}}{\bigcirc} \]
\(\quad\)In mathematical terms
\[\begin{align*} \hat{y} &= \theta_3 \big( \iterate{x}{2} \big) \fragment{= \theta_3 \big( \sigma \big( \iterate{z}{2} \big) \big)} \fragment{= \theta_3 \big( \sigma \big( \theta_2 \big( \iterate{x}{1} \big) \big) \big)} \\ &= \theta_3 \big( \sigma \big( \theta_2 \big( \theta\big(\iterate{z}{1}\big) \big) \big) \big) \fragment{= \theta_3 \underbrace{\sigma\big(\theta_2 \overbrace{\sigma\big( \theta_1 x \big)}^{\text{Hidden\\ Layer}~1} \big)}_{\text{Hidden\\ Layer}~2}} \end{align*}\]
Let’s assume a single sample \((x,y)\) and the MSE loss: \(L(\theta)=(\hat{y}-y)^2\). Using the chain rule, we get:
\(1.\) Gradient w.r.t. \(\theta_3\): \[ \pdiff{L}{\theta_3} = \textcolor{red}{\underbrace{\pdiff{L}{\hat{y}}}_{=2(\hat{y}-y)}}\pdiff{\hat{y}}{\theta_3}. \]
\(2.\) Gradient w.r.t. \(\theta_2\): \[ \pdiff{L}{\theta_2} = \textcolor{red}{\pdiff{L}{\hat{y}}}\textcolor{blue}{\pdiff{\hat{y}}{\iterate{x}{2}}\underbrace{\pdiff{\iterate{x}{2}}{\iterate{z}{2}}}_{=\sigma'}}\pdiff{\iterate{z}{2}}{\theta_2}. \]
\(3.\) Gradient w.r.t. \(\theta_1\): \[ \pdiff{L}{\theta_1} = \textcolor{red}{\pdiff{L}{\hat{y}}}\textcolor{blue}{\pdiff{\hat{y}}{\iterate{x}{2}}\underbrace{\pdiff{\iterate{x}{2}}{\iterate{z}{2}}}_{=\sigma'}}\textcolor{green}{\pdiff{\iterate{z}{2}}{\iterate{x}{1}}\underbrace{\pdiff{\iterate{x}{1}}{\iterate{z}{1}}}_{=\sigma'}}\pdiff{\iterate{z}{1}}{\theta_1}. \]
\(\Rightarrow\) we propagate the loss back through the network, reusing most of the previous calculations!
In order to mitigate overfitting (i.e., good performacne on \(\Dtrain\), poor generalization), neural networks can be regularized by
In machine learning (as well as in nature) many tasks are independent of absolute position.
A function is invariant under some transformation if it’s output \(y\) remains unchanged under transformations of the input \(x\).
A function is equivariant under some transformation if it’s output \(y\) is transformed “in the same way” as the input \(x\).
đź’ˇ These ideas can be formalized and extended under the framework of group theory, see also geometric deep learning (Bronstein et al. 2021).
numpy.linalg.pinv).
đź’ˇ The layers \(1\) to \(L-1\) of a neural network can also be seen as a feature transform, learned autmoatically from data.