- Why is the initialisation of a network so important?
- Deeper networks seem to work generally better, provided they can be trained
Consider a randomly initialised neural network.
For each layer \(l\)
\[h^l_i = \sum_j W^l_{ij} y^l_j + b_i^l\] \[ y^{l+1}_i = \phi\left(h^l_i\right) \]
where \(W^{l}_{ij} \sim N\left(0, \frac{\sigma_w^2}{N_l}\right)\) and \(b^l_i \sim N(0, \sigma^2_b)\).
The signal input is \(y^0_i = x_i.\) The numbers \(\sigma_w^2\) and \(\sigma_b^2\) are constants for the whole network.
Each \(h_i\) is assumed to be approximately Gaussian.
Find mathematical relations between the layer parameters.
But, all the activations in the previous layers are also identically Gaussian distributed
\[\frac{1}{N_{l-1}} \sum_{i=1}^{N_{l-1}}\left<\left(\phi\left(h_i^{l-1}\right)\right)^2\right> = \int_{-\infty}^{\infty} \frac{dz}{\sqrt{2\pi}} \mathcal{e}^{-\frac{z^2}{2}} \phi^2\left(\sqrt{q^{l-1}}z\right)\]
At a fixed point \(q^*\), the variance maps to itself
\[q^* = f\left(q^* | \sigma^2_w, \sigma^2_b\right)\]
The stability of a fixed point depends on the slope of \(f\) at \(q^*\)
Depending on the values of \(\sigma^2_w\) and \(\sigma^2_b\), the network has nonzero fixpoint for \(q\), it is always stable for \(\tanh\) activation.
Covariance measures how similar signals are. We would expect that similar inputs have similar outputs.
\[q^l_{ab} = \sum_{i=1}^{N_{l-1}} h^l_i(x_a^0)h^l_i(x_b^0)\]
One can show that \(c_{12} = 1\) is always a fixed point, but it is not stables as
\[\chi_1 = \frac{\partial c^l_{12}}{\partial c^{l-1}_{12}} = \sigma_w^2 \int D z_1 Dz_2 \phi'(u_1)\phi'(u_2)\]
\[\left|q^l - q^*\right|\sim \mathrm{e}^{-l/\xi_q} \hspace{1cm} \left|c^l - c^*\right|\sim \mathrm{e}^{-l/\xi_c}\]
For tanh activation, \(\xi_q\) stays always finite, but can become large close to a transition. At a transition, \(\xi_c\) always diverges
Even small amounts of dropouts make \(c = 1\) fixed point unstable
Apply similar reasoning to backpropagation and find
\[q_{aa}^l = q_{aa}^L \mathrm{e}^{-\frac{L-l}{\xi_\nabla}}, \hspace{1cm} \xi^{-1}_\nabla = -\log \chi_1\]
Observation: Covariance between gradients can be shown to still follow the \(\xi_c\) lengthscale
The colour indicates the training accuracy on MNIST, red is good