Revision

Back to Deep Learning


Bacth Normalization

Bacth Normalization is a method to improve stability during training of a neural network.


Internal covariate shift

Each layer of a neural network has inputs with a corresponding distribution, which is affected during the training process by the randomness in the parameter initialization and the randomness in the input data. The effect of these sources of randomness on the distribution of the inputs to internal layers during training is described as internal covariate shift. Although a clear-cut precise definition seems to be missing, the phenomenon observed in experiments is the change on means and variances of the inputs to internal layers during training.

Batch normalization was initially proposed to mitigate internal covariate shift. During the training stage of networks, as the parameters of the preceding layers change, the distribution of inputs to the current layer changes accordingly, such that the current layer needs to constantly readjust to new distributions. This problem is especially severe for deep networks, because small changes in shallower hidden layers will be amplified as they propagate within the network, resulting in significant shift in deeper hidden layers. Therefore, the method of batch normalization is proposed to reduce these unwanted shifts to speed up training and to produce more reliable models.


Other benefits

Besides reducing internal covariate shift, batch normalization is believed to introduce many other benefits. With this additional operation, the network can use higher learning rate without vanishing or exploding gradients. Furthermore, batch normalization seems to have a regularizing effect such that the network improves its generalization properties, and it is thus unnecessary to use dropout to mitigate overfitting. It has been observed also that with batch norm the network becomes more robust to different initialization schemes and learning rates.


Formula

Training

Let \(B\) be a batch (sub-sample of data) of size \(m\) during the training of the neural network. The mean of the output of layer \(l\) for batch \(B\) is \(\mu_B\) and the variance is \(\sigma_B^2\):

\[\mu_l^B = \frac{1}{m} \sum_{j=1}^{m} x_l^{(j)}\] \[\left(\sigma^2\right)_l^B = \frac{1}{m}\sum_{j=1}^{m} (x_l^{(j)} - \mu_l^B)^2\]

Now for a given output of layer \(l\) and batch \(B\) \(x_l^{(i)}\) its post Batch Normalisation value is:

\[x_l^{(j)} = \frac{x_l^{(j)} - \mu_l^B}{\sqrt{\left(\sigma^2\right)_l^B + \varepsilon}}\]

Where:


Inference

During inference no batch are available. The prediction is done input by input. Hence we can’t compute the mean and variance on the fly.

The idea is to save the mean and variance of the whole training set and reuse it:

\[\mu_l = \frac{1}{n_B} \sum_{i=1}^{n_B} \mu_l^B\] \[\sigma_l^2 = \frac{1}{n_B-1} \sum_{i=1}^{n_B} \left(\sigma^2\right)_l^B\]

Then for a layer’s input \(x_l\):

\[x_l = \frac{x_l - \mu_l}{\sqrt{\sigma_l^2 + \varepsilon}}\]


Practical consideration

Where to apply Batch Normalization is a subject of debates. The authors of the Batch Norme paper suggest to use it before the activation function but François Chollet explained on its Keras Github that it was better to use it after the activation function.


Resources

See: