Shouldn't mean & std be calculated across both the train & validation datasets?

[Context]
Book: Deep Learning with PyTorch by Eli Stevens, Luca Antiga, and Thomas Viehmann @lantiga
Jupyter Notebook: Part 1 Chapter 7

As shown below the means and standard deviations for the RGB channels used in normalization for both the CIFAR10 training and validation datasets are the same.

And, they were calculated based on the training set only, as shown in a previous notebook in the same chapter:

[Question]
Shouldn’t we calculate these means and standard deviations across the entire dataset, i.e. across both the training and validation datasets by stacking them up, as shown below?

Or, The way of “calculating means and stds for normalization using training dataset only” in the book is correct because it is to prevent the information in the validation dataset from “contaminating” normalization, and consequently, affecting the training?

validation or test sets are supposed to be unknown.
If you use any information from them, you are cheating the results.

Thanks. So, it’s to prevent the information in validation/test datasets from “contaminating” the training, right?

Yes, in fact, the reason why there exist validation and test sets is because we are still using the information from validation . When you save a checkpoint or weights based on validation, you are taking the best model for that subset but it doesn’t mean that it’s the best model for test set.

Simply, when you validate or test you cannot use prior information from those sets.

2 Likes