Shouldn't mean & std be calculated across both the train & validation datasets?

oat · February 13, 2021, 12:54pm

[Context]
Book: Deep Learning with PyTorch by Eli Stevens, Luca Antiga, and Thomas Viehmann @lantiga
Jupyter Notebook: Part 1 Chapter 7

As shown below the means and standard deviations for the RGB channels used in normalization for both the CIFAR10 training and validation datasets are the same.

And, they were calculated based on the training set only, as shown in a previous notebook in the same chapter:

[Question]
Shouldn’t we calculate these means and standard deviations across the entire dataset, i.e. across both the training and validation datasets by stacking them up, as shown below?

Or, The way of “calculating means and stds for normalization using training dataset only” in the book is correct because it is to prevent the information in the validation dataset from “contaminating” normalization, and consequently, affecting the training?

JuanFMontesinos · February 13, 2021, 12:56pm

validation or test sets are supposed to be unknown.
If you use any information from them, you are cheating the results.

oat · February 13, 2021, 12:58pm

Thanks. So, it’s to prevent the information in validation/test datasets from “contaminating” the training, right?

JuanFMontesinos · February 13, 2021, 1:03pm

Yes, in fact, the reason why there exist validation and test sets is because we are still using the information from validation . When you save a checkpoint or weights based on validation, you are taking the best model for that subset but it doesn’t mean that it’s the best model for test set.

Simply, when you validate or test you cannot use prior information from those sets.