Past Review

Past articles in this tutorial series are as follows:

[1] Implement a deep learning framework | from scratch Lay the groundwork

[2] Implement a deep learning framework from scratch | Create a basic learning framework

[3] Implement a deep learning framework from scratch | Activation function, loss function with convolutional layer

[4] Implement a deep learning framework from scratch | Introduction and implementation of common optimization algorithms

[5] Implement a deep learning framework from scratch | When a neural network meets a genetic algorithm

For the full source code (welcome stars):

https://github.com/CharlesPikachu/pytoydl

Note that the default reader of this series of articles is a certain high number and machine learning knowledge, and can use python as a programming tool without obstacles.

Lead

In the last issue, we took you to train a program that automatically plays a mini-game with neural network + genetic algorithm:

Implement a deep learning framework | from scratch When a neural network meets a genetic algorithm

Mainly to avoid this series of articles being too boring, so there will be some interesting neural network applications interspersed in between. In this issue, we return to the main topic and take you to a deeper understanding of what normalization in neural networks is.

Without further ado, let’s get started

The origin of normalization

As we all know, data that satisfies the conditions of independent co-distribution is often more favored by deep learning alchemists, namely:

independent and identically distributed (i.i.d.)

The satisfaction i.i.d. that we often see in papers generally refers to the independent same distribution of data satisfaction. For example, the popular naïve Bayesian algorithm must require the data to meet an independent homogeneous distribution. Of course, neural networks do not enforce independent homodistribution of input data, but alchemists’ countless experiences tell us that data that satisfies independent homodistribution is often more useful for model training. An intuitive explanation is that if the input distribution of each layer of the neural network varies greatly, then the network must constantly adjust the data to adapt to different distributions, increasing the difficulty of model learning. In general, the problem that the input data of each layer of the neural network does not meet the independent same distribution is also called:

Internal Covariate Shift (ICS)

To put it more professionally, statistical machine learning generally assumes that “the data distribution of the source space and the target space are consistent”, and the meaning of covariate shift is that the conditional probabilities of the source space and the target space are the same:

But its edge probabilities are different:

Specifically, for neural networks that do not have independent same-distribution constraints, due to the operations within the neural network layer, the distribution of input signals and the distribution of output signals in each layer are different, but their corresponding sample labels are the same. By the way, it’s called internal covariate shift because it’s specific to neural networks, and we’re analyzing interlayer signals, so we add an internal in front of the covariate shift.

Therefore, before feeding the data to the neural network, we tend to do some preprocessing of the data:

Independent: Removal of correlations;

Same distribution: Guaranteed to have the same mean and variance.

This preprocessing is generally also called “whitening”, the most classic method is PCA whitening, in short, it is to perform the same operation as PCA dimensionality reduction, but the number of selected feature vectors and data dimensions are the same. For students who do not know what PCA dimensionality reduction is, you can refer to the following link to learn: https://zhuanlan.zhihu.com/p/77151308

References

[1]. https://zhuanlan.zhihu.com/p/33173246

[2]. http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/

[3]. https://www.zhihu.com/question/38102762/answer/85238569

Normalization algorithm explained in detail

As we said earlier, for a neuron in a neural network:

Due to ICS problems, the distribution of input data may vary greatly, which is not conducive to learning. Therefore, the most “standard” solution is to whiten all the input data, but this operation is not differentiable, and the amount of calculation required is too large, so the Normalization method represented by BN has become a mainstream solution to the ICS problem.

Specifically, the basic idea of the scheme is to translate and telescoping the input data, and the formulaic expression is as follows:

The parameters in the formula are the translation parameters, the scale parameters, and after these two parameters are processed, they become a standard distribution with a mean of 0 and a variance of 1:

The next thing is the repanning parameter, the rescaling parameter, and finally, the output after passing through the normalization layer is the distribution with the mean and variance being :

Among them, the first step is easy to understand, that is, to ensure that the input data is distributed with the same distribution, so what is the second step for? In general, we think this is to ensure that the expressive power of the model does not decrease as a result of normalization operations. If friends who know about self-attention can recall (this we will talk about in future chapters), is the most used self-attention method now key, value, and query actually different? In my opinion, the functions of these processes are all similar in a sense, that is, to enhance the characterization ability of the model.

Writing here, many people may have doubts, plus the panning and re-zooming parameters, won’t it become the same as before? Not really, because before the introduction of normalization, the mean of the input was determined by the parameters of all the neural networks before the neural network in that layer, but after the introduction of normalization, the mean of the input was only determined by , which is obviously different, which removes the association between the distribution and the previous neural network parameters, making model training easier.

We can give a few simple examples to see what benefits the introduction of normalization brings to network training:

1. Weight scaling invariance

That is, when the network weights are scaled according to constants, the resulting normalized value remains unchanged:

The derivation is as follows:

Therefore, we have:

That is, the expansion and contraction change of the weight does not affect the Jacobian matrix of the gradient of backpropagation, avoiding the problem of gradient disappearance and gradient explosion caused by excessive or too small weights in the backpropagation process.

In addition, due to:

That is, when the weight of the network is large, the corresponding gradient will be relatively small, which plays a role in the regularization of the parameters and avoids the large oscillation of the network.

2. Data scaling immutability

That is, when the input data is scaled according to constants, the resulting normalized value remains unchanged:

Therefore, we have:

In other words, the scaling transformation of the data will not affect the update of the weight parameters of the layer, which is conducive to ensuring the stability of training.

Next, let’s learn some of the existing normalization methods.

1. Batch Normalization

BN can be seen as a longitudinal normalization method that fits into the normalization formula we mentioned earlier:

Parameters and in BN are calculated as follows:

where the amount of data (i.e., batch size) entered on behalf of the network for a gradient learning is performed.

Put a graph to show more intuitively how BN is calculated:

2. Layer Normalization

LN can be seen as a horizontal normalization method that fits into the normalization formula we mentioned earlier:

Parameters and in LN are calculated as follows:

That is, LN is performed on a single training sample.

Put a graph to show the calculation of LN more intuitively:

3. Instance Normalization

IN can also be applied to the normalization formula we mentioned earlier:

Parameters and IN are calculated as follows:

i.e. IN is performed on a single feature map of a single training sample.

Put a graph to show the calculation of IN more intuitively:

4. Group Normalization

GN is actually a combination of IN and LN, which is applied to the normalization formula we mentioned earlier:

Parameters and in GN are calculated as follows:

where G represents the number of feature maps in a single group, that is, GN is performed on certain feature maps of a single training sample.

Put a graph to show the calculation of GN more intuitively:

5. Weight Normalization

The first four normalization methods all apply normalization to the input feature data, while WN takes a different approach and applies normalization to network weights.

Specifically, WN breaks down the network weights into two parts: vector direction and vector modulus:

So, we just have to fix it to achieve the effect that we want, which is that we have:

To fit into the formula we summarized earlier:

It can be seen that in WN, and is fixed to 0, then . In other words, WN is equivalent to a scaling transformation of the input data using the Euclidean norm of the neuron’s weights.

6. Cosine Normalization

After getting the data and weights, the researchers stared at the operators before the data and weights, and the CN came out. Specifically, they modified the original linear transformation to (i.e., calculating the angle between the weights and the inputs):

So all the data is normalized to the interval. Simply put, CN is essentially defined on the basis of WN, or can it be applied to the normalization formula we summarized at the beginning.

All in all, although different researchers have proposed a variety of normalizations so far, their applicable scenarios are different, for example, LN is generally used in the field of NLP, while BN is suitable for scenarios between mini-batches and mini-batch and overall data distributions are approximately the same distribution, they are all inseparable and can be applied to the same unified formula.

References

[1]. https://arxiv.org/abs/1803.08494

[2]. https://zhuanlan.zhihu.com/p/33173246

Next Issue Notice:

Next articleWe’ll use numpy to implement several common normalization methods in our own deep learning framework, pytoydl:

https://github.com/CharlesPikachu/pytoydl

And detailed description of their advantages and disadvantages.