Revision

Back to Deep Learning

Initialisation

The weights of a neural network are initialized to given values. A coherent initialisation is mandatory to obtain good gradient flows and convergence of the network.

On the contrary bad initialisation leads to exploding or vanishing gradient.

In the following parts, \(n_l\) is the number of neurons of layer \(l\).

LeCun initialisation

LeCun initilization was the first to deal with normality of layers’ weights and gradient flow. Their exist two different types of LeCun initialisation, uniform and normal.

Uniform

\[W^l \sim \mathcal{U} \left[-\frac{\sqrt{3}}{\sqrt{n_l}}, \frac{\sqrt{3}}{\sqrt{n_l}}\right]\]

Normal

\[W^l \sim \mathcal{N}\left(0, \frac{1}{n^l}\right)\]

Xavier (or Glorot) initialisation

Xavier initialisation initialised the weights of the network in order to obtain standard normal value after getting the matrix product \(W X\) and a \(\tanh\) activation function. The initialisation depends on the number of neurons in the preceding layer and the current layer.

Their exist two different types of Xavier initialisation, uniform and normal.

Uniform

\[W^l \sim \mathcal{U} \left[-\frac{\sqrt{6}}{\sqrt{n_l + n_{l+1}}}, \frac{\sqrt{6}}{\sqrt{n_l + n_{l+1}}}\right]\]

Normal

\[W^l \sim \mathcal{N}\left(0, \frac{2}{n^l + n^{l+1}}\right)\]

He initialisation

He initialisation is a new (but simple and very similar to LeCun intialization) way of initializing weights.

It has been noticed that Xavier does not work very well with ReLU initialization (its variance of unit 1 is proved for \(\tanh\) activation function). He initialisation is another method meant to work better with ReLU activation function.

Uniform

\[W^l \sim \mathcal{U}\left[-\frac{\sqrt{6}}{\sqrt{n_l}}, \frac{\sqrt{6}}{\sqrt{n_l}}\right]\]

Normal

\[W^l \sim \mathcal{N}\left(0, \frac{2}{n^l}\right)\]

Resources

See:

DeepLearning.ai with a proof on the variance of Xavier initialisation,
wandb.ai
Pytorch,
Tensorflow.