Should I calculate dataset mean over just training data, or testing data as well?

This might be somewhat splitting hair, but I am wondering about this:

When you calculate the dataset mean and std, do you compute it over just the training data or do you compute it over the entire dataset (train + test)?

What are the underlying mathematical considerations when making this decision?

Thank you!

Ideally, we will not have access to test data except the assumption that test data will come from same distribution. Thus, we have to calculate mean and std only over train data.

1 Like