Skip to content

Latest commit

 

History

History
142 lines (79 loc) · 7.63 KB

File metadata and controls

142 lines (79 loc) · 7.63 KB
mathjax true

Batch Normalization

kaishen, 2 Mar, 2018

The original paper can be found here.

This is one of my favourite paper in the Deep Learning area. I will continually update this post when i got deeper understanding on this field.

Important thing first

Bear in mind, BN(Batch Normalization) is a technique that can significantly accelerate Deep Network training. In case of too many knowledge points making you confused, just remember, it can accerelate the training. Accept this, and then we try to learn how to use this technique, and then why it works.

Implementation of BN

The first is that instead of whitening the features in the layer inputs and outputs jointly, we will normalize each scalar features independently, by making it have the mean of 0 and the variance of 1.

This simplication can be shown in this code snippet, note that the opertiaon is column-based (each diemension operates independently).

        mean = np.mean(x, axis = 0)
        var  = np.var(x, axis = 0)
        x_hat = ( x - mean ) / np.sqrt(var + eps)
        out = gamma * x_hat + beta

        cache = (mean, var, x_hat, eps, gamma,  beta)

The $\gamma$ and $\beta$ are learned parameters, and they will be updated during the training procedure.

The original algorithm for doing the BN is shown follow.

BN forward update

During training, we need to backpropagate the gradient of loss $L$ through this transformation, as well as compute the gradients with respect to the parameters of BN transform.

BN backpropagation

If you are not familiar with the chain rule, and don't understand how they are coumputed, never mind. First check this equation in this wiki.

$\frac{\partial y}{\partial x_i}=\sum\limits_{l=1}^{m} \frac{\partial y}{\partial u_l} \cdot \frac{\partial u_l}{\partial x_i}$

i.e.

$\frac{\partial u}{\partial r}=\frac{\partial u}{\partial x} \cdot \frac{\partial x}{\partial r} + \frac{\partial u}{\partial y} \cdot \frac{\partial y}{\partial r} $

Then you can check this great post. In case of the post is hang, I have my own derivation. here

The code snippet for BN backpropagation is shown as follow

	mean, var, x_hat, eps, gamma, beta = cache
    dgamma = np.sum(dout * x_hat , axis = 0)
    dbeta  = np.sum(dout, axis = 0)
    N = dout.shape[0]
    dxhat = dout * gamma
    #this equation comes from the derivation, refer to https://kevinzakka.github.io/2016/09/14/batch_normalization/
    dx = (N*dxhat - np.sum(dxhat, axis = 0) - x_hat * np.sum(x_hat * dxhat, axis = 0))/(N*(np.sqrt( var +eps )))

For all of the above code, can be found here.

Understanding the BN

To be honest, understanding why BN works is not easy. There are too many knowledge points, whcih are difficult to consider them at the same time, and connect them together. I write this post in case of forgeting some of the points and I will update this part when I obtain new understanding.

If you want to check how Ian Goodfellow describes BN, please check here.

If you want to check how CS231n describes BN, please check here.


In the following discussion, I want to stick to sigmoid activation function cause it may be clearer to show the saturation. Consider such a forward pass in a DNN(Deep Nerual Network)

affine1 -> BN1 -> sigmoid1 -> affine2 -> BN2 -> sigmoid2

point 1

As shown in following pciture, the output of the affine1 layer may be this distribution(just for illustration).

Arbitrary distribution

If this goes directly into the sigmoid1, obviously most of them will go into the saturation region of sigmoid. Later in the backward pass, those gradients will be zero, and result in slowly training (aka gradient vanishing).

On the other hand, with the help of BN1, the output distribution is map to this distribution. Initially, $\gamma$ is set to 1 while $\beta$ is set to 0, which is just an unit gaussian distribution.

fixed distribution

Think in this way, BN provides a relatively "fixed" distribution for the layers in behind. For example, the $W2$ and $b2$ in the affine2 layer will take following as input.

$y_i = \gamma \hat x_i + \beta$

$sigmoid(y_i)$

Here, the $\hat {x_i} \sim \mathcal {N}(0,1)$, if $\gamma$ and $\beta$ changes gradually and slowly, the $y_i$ distribution is stable. In this sense, the output of $sigmoid(y_i)$ is more stable, which means, the $W2, b2$ dont have to readjust too much to compensate the change in the $W1, b1$. Although changes in $W1 , \ b1$ will change $x_i$ 's distribution, but $\hat{xi}$ remains the $\mathcal {N}(0,1)$, and $y_i$'s distribution is controlled by $\gamma$ and $\beta$. That means sigmoid1's input and output are quite stable, then $W2, b2$ need not always hugely readjust themselves to adapt another distribution. Think of there are many layers, this actually saves a lot effort. This somewhat seems like there is a barrier to stop the changes in this layer to affect the next layer.

changes in affine1 -> Barrier -> sigmoid1 -> changes in affine2 -> Barrier -> sigmoid2

point 2

In the backward pass, since most of the activation (sigmoid) is inside the linear region, the cases that gradient is zero are reduced. So BN can alleviate gradient vanishing problem.

point 3

Batch Normalization also makes training more resilient to the parameter scale. With Batch Normalization, back-propagation through a layer is unaffected by the sacle of its parameter. Indeed, for a scalar $a$.

$BN(Wu)=BN((aW)u)$

This is becaused the distribution is determined by $\gamma$ and $\beta$.

and then we can show that

$\frac{\partial BN((aW)u)}{\partial u} = \frac{\partial BN(Wu)}{\partial u}$

$\frac{\partial BN((aW)u)}{\partial(aW)}= \frac{\partial BN(Wu)}{\partial (aW)}= \frac{1}{a} \cdot \frac{\partial BN(Wu)}{\partial W}$

This enables higher learning rates.

point 4

$\gamma $ and $\beta$ can be learnd, this increases the model's flexibility and represent power. The network can learn the best $\gamma$ and $\beta$ that minimize the loss, i.e. the best distribution that network prefers. When the network think the original distribution is the best, then

$\gamma = \sqrt{Var[x]}$

$\beta = E[x]$

With this setting, we could recover the original activations, if that were the optimal thing to do.

point 5

There are a lot of debates on whether the BN should be utilized before or after activation layer(ReLu)? The author of original paper suggests that

x=Wu+b, u is likely the output of another non-linearity, the shape of its distribution is likely to change during training, and contraining its first and second moments would not eliminat the co-variate shift. In contrast, Wu+b is more likely to have a symmetric, non-sparse distribution, that is "more Gaussian"; normalizing it is likely to produce activations with a stable distribution.

Somebody finds that with the BN layer located after the ReLu, the performance increases. However, there is no solid evidence and this remains to be learnd further to explain why.