I learned from the forum that these are used because it has been computed on imagenet database. My doubt is why can’t we use mean and std per image for nomalization?

You can really use what you want. Normalizing wrt dataset makes all the samples to belong to a distribution with zero mean and unitary variance. The other way around your input follows several distributions.

Thanks for the answer. I wan’t thinking like that. I though an images pixels distribution is contingent to the source from where it is obtained and thus uniquely could have high contrast, low saturation etc. Therefore, it is better to normalize individually than all together. The main purpose of normalization was to have faster learning and better convergence. For classification task objective would be learn distribution over images to discriminate between classes.

If you normilize your pixels wrt images, you are losing perspective of differences between images. You map everything into the same normal distribution like fig 2.
If you normalize wrt dataset, each image still has unique distribution with its own mean and std.

You can probably yet classify learning distributions over images but there is no gain on loosing information about differences between images.

In addition we are simplifying ideal classification that would be having per pixel estimators rather than per-channel estimators.

Anyway normalization was dataset-wise because ideally we are supposed to iterate over the whole dataset instead of using mini-batch strategy. There are arguments about using batch-wise normalization instead of dataset.

Thank you again for your answer. I understand that while I will be still able to learn the classifying distribution but I will be loosing the perspective of differences between images which is further emphasized by the plots. However, I do get the intuition why would we loose the value of difference between images?

I currently do the following

# img_tensor is of shape C x H x W
mean_vals= img_tensor.view(img_tensor.shape[0], -1).mean(dim=1)
std_vals = img_tensor.view(img_tensor.shape[0], -1).std(dim=1)
normalizer = transforms.Normalize(mean=mean_vals, std=std_vals)
img_tensor = normalizer(img_tensor) -- (1)

instead of

# Consider the images are from ImageNet
mean_vals = [0.485, 0.456, 0.406]
std_vals = [0.229, 0.224, 0.225]
normalizer = transforms.Normalize(mean=mean_vals, std=std_vals)
img_tensor = normalizer(img_tensor) -- (2)

You are saying the (1) would give me middle plot and (2) would give me the first or third plot.

In the end, when you normalize you are getting zero mean and unitary variance.
This is roughly a circle in a plot.

Suppose you have an image big (and variate) enough. If you sample pixels from that image and plot them, you would end up filling the whole circle, namely, having points for all the possible situations.

As you image-wise normalized , each image follows same distribution, 0 mean unitary variance. If you sample pixels from different images and you plot them together you would get 2nd plot. All the points fall in same range.

On contrary, if you dataset-wise normalize, each image would have a different distribution with a unique mean and unique variance. If you sample pixels from each image, you would get different distributions like plot 3.

Explaining it the other way around.
If you apply per-image normalization and then get a single image. You would observe mean of its pixels is zero, and std 1.

If you apply dataset normalization and then you get a single image you would observe mean of its pixels is not zero and std is not one.

Thanks Juan. You make it more clear now. Now, I have to think about when in past I have normalized 2D data per subject and still got results comparable to SOTA. Maybe I will rerun the experiments someday.

How did we get these numbers. Any ideas? Maybe if we glue all imagenet images into a single big image and then calculate the meant and std? Is this the way?

The mathematical way is just summing up everything channel-wise for the whole dataset and dividing by amount of pixels.

As imagenet is huge they may have chosen a subset of images or maybe mean per-image means (if images are same size it’s equivalent to the first operation).

Would be great if someone can point to the archive paper on why it is used, and how it is calculated exactly. Good for the reference. Also, I would like to say there is an ideal set of transforms for the whole dataset, that would help the machine learning, but this may not be true until some other constraints are set, for instance it may be paired with ideal learning rates, or some other hp. Still, the lr may change over time so looks complicated.

In tensorflow official resnet model, they use per image normalization.
The 2D example is right only for low dimension data.
I think for high dimension data, normalization does not mix different classes because they have different directions.

You can normalize your image wrt itself. This would be taking an image and for each channel: summing up all its pixels, compute mean and std and normalizing .

Yoy can normalize an image wrt the dataset it comes from. This would be taking the whole dataset and for each channel of each image: summing up all its pixels. Once you have the total sum for the whole dataset, just divide by the amount of pixels (of the whole dataset). Obviously this may be very slow or not feasible for a very big dataset in which case you can take a representative subset or to use some maths