[Transfer Learning] What is the purpose of normalization?

Hi Everybody,

I have been discussing intensively with my colleagues about dataset normalization in transfer learning.
In particular, we understand why z-score normalization is performed over the dataset when a network is trained from scratch.
However, we are not sure about the utility of the normalization in transfer learning.

Let’s suppose that we want to apply transfer learning to a network that has been previously trained on a scalar (1-D) dataset. This dataset, named dataset0, is characterized by mean u0 and standard deviation std0. The network has been trained from scratch with the given dataset z-score-normalized using u0 and std0. That is to say: each data point has been subtracted by u0 and divided by std0.

When we want to apply transfer learning to the pretrained network, the common data transformation routine that precede the training on the new dataset, named dataset1, is the following:

  1. Variable data augmentation techniques applied to the dataset
  2. transformation to tensor which also normalize data values from 0 to 1
  3. normalization with values of mean u0 and a standard deviation std0 of the dataset0 used to train the pre-trained model

The question is: β€œWhy u0 and std0 are used in point number 3?”
We might think that standard deviation and mean of dataset1 should be used instead. If u0 and std0 are used on dataset1, the resulting normalized dataset will not necessarily have zero mean nor standard deviation equal to 1.

Our hypotheses about this approach is that the usage of u0 and std0 keeps a statistic consistency between the two datasets. In particular, a β€œ1” in dataset0 will be mapped to β€œ(1-u0)/std0” after the normalization in the training from scratch phase. Then, a β€œ1” in dataset1 will be mapped to the same value β€œ(1-u0)/std0” after the normalization in the transfer learning phase if u0 and std0 are used to normalize dataset1. If u1 and std1 were used as normalization parameters instead, two equal values in dataset1 and dataset0 will be mapped to two different values after the normalization.
This explanation is what we think stays behind the torchvision.transforms.Normalize(u0,std0) in transfer learning.

What is your idea about that?

Thank you in advance.

Borelg & matRazor


This is my understanding (It may be wrong as well).

Normalization is done to bring the data to the standard gaussian distribution N(0,1). When we use the trained model in dataset0 for transfer learning (to dataset1), we expect the data in dataset1 in N(0,1) as well. At this point, I would say, if the dataset1 is large enough, it is wise to use the mean and std of dataset1 to normalize the data in transfer learning. But mostly transfer learning is done for the datasets with considerably less data. Hence, the calculated statistics (mean and std) might not represent the dataset1 distribution. We could rather use the statistics (mean and std) from dataset0 for normalization in this case.

The practice of using the mean and std from dataset0 for normalization is prevalent while using ImageNet pre-trained models. As mentioned above, the statistics of ImageNet is calculated from 1 million natural images, which might hold for other datasets with natural images (with considerably less data) as well.