| mathjax | true |
|---|
kaishen, 2 Mar, 2018
The original paper can be found here.
This is one of my favourite paper in the Deep Learning area. I will continually update this post when i got deeper understanding on this field.
Bear in mind, BN(Batch Normalization) is a technique that can significantly accelerate Deep Network training. In case of too many knowledge points making you confused, just remember, it can accerelate the training. Accept this, and then we try to learn how to use this technique, and then why it works.
The first is that instead of whitening the features in the layer inputs and outputs jointly, we will normalize each scalar features independently, by making it have the mean of 0 and the variance of 1.
This simplication can be shown in this code snippet, note that the opertiaon is column-based (each diemension operates independently).
mean = np.mean(x, axis = 0)
var = np.var(x, axis = 0)
x_hat = ( x - mean ) / np.sqrt(var + eps)
out = gamma * x_hat + beta
cache = (mean, var, x_hat, eps, gamma, beta)The
The original algorithm for doing the BN is shown follow.
During training, we need to backpropagate the gradient of loss
$L$ through this transformation, as well as compute the gradients with respect to the parameters of BN transform.
If you are not familiar with the chain rule, and don't understand how they are coumputed, never mind. First check this equation in this wiki.
i.e.
Then you can check this great post. In case of the post is hang, I have my own derivation. 
The code snippet for BN backpropagation is shown as follow
mean, var, x_hat, eps, gamma, beta = cache
dgamma = np.sum(dout * x_hat , axis = 0)
dbeta = np.sum(dout, axis = 0)
N = dout.shape[0]
dxhat = dout * gamma
#this equation comes from the derivation, refer to https://kevinzakka.github.io/2016/09/14/batch_normalization/
dx = (N*dxhat - np.sum(dxhat, axis = 0) - x_hat * np.sum(x_hat * dxhat, axis = 0))/(N*(np.sqrt( var +eps )))For all of the above code, can be found here.
To be honest, understanding why BN works is not easy. There are too many knowledge points, whcih are difficult to consider them at the same time, and connect them together. I write this post in case of forgeting some of the points and I will update this part when I obtain new understanding.
If you want to check how Ian Goodfellow describes BN, please check here.
If you want to check how CS231n describes BN, please check here.
In the following discussion, I want to stick to sigmoid activation function cause it may be clearer to show the saturation. Consider such a forward pass in a DNN(Deep Nerual Network)
affine1 -> BN1 -> sigmoid1 -> affine2 -> BN2 -> sigmoid2
As shown in following pciture, the output of the affine1 layer may be this distribution(just for illustration).
If this goes directly into the sigmoid1, obviously most of them will go into the saturation region of sigmoid. Later in the backward pass, those gradients will be zero, and result in slowly training (aka gradient vanishing).
On the other hand, with the help of BN1, the output distribution is map to this distribution. Initially,
Think in this way, BN provides a relatively "fixed" distribution for the layers in behind. For example, the
Here, the
changes in affine1 -> Barrier -> sigmoid1 -> changes in affine2 -> Barrier -> sigmoid2
In the backward pass, since most of the activation (sigmoid) is inside the linear region, the cases that gradient is zero are reduced. So BN can alleviate gradient vanishing problem.
Batch Normalization also makes training more resilient to the parameter scale. With Batch Normalization, back-propagation through a layer is unaffected by the sacle of its parameter. Indeed, for a scalar
$a$ .
This is becaused the distribution is determined by
and then we can show that
This enables higher learning rates.
$\gamma $ and
With this setting, we could recover the original activations, if that were the optimal thing to do.
There are a lot of debates on whether the BN should be utilized before or after activation layer(ReLu)? The author of original paper suggests that
x=Wu+b, u is likely the output of another non-linearity, the shape of its distribution is likely to change during training, and contraining its first and second moments would not eliminat the co-variate shift. In contrast, Wu+b is more likely to have a symmetric, non-sparse distribution, that is "more Gaussian"; normalizing it is likely to produce activations with a stable distribution.
Somebody finds that with the BN layer located after the ReLu, the performance increases. However, there is no solid evidence and this remains to be learnd further to explain why.



