Mean is not 0 after subtracting mean

autoencoder · April 13, 2021, 2:08pm

I have a (178741, 304) dataset where the first dimension correponds to one sample and the second dimension are the features of a sample. The means of the features range between -0.75 and 0.99 and I want to normalize the dataset so that the mean is 0.0 afterwards.

x = the dataset
x_mean = x.mean(axis=0)
print("Before", x_mean.min(), x_mean.max())

x_centered = x - x_mean
x_centered_mean = x_centered.mean(axis=0)
print("After", x_centered_mean.min(), x_centered_mean.max())

However, after the normalization the mean is still not 0. It got much smaller, but is still relatively high:

Before tensor(-0.748372793198) tensor(0.999999940395)
After tensor(-0.001484304899) tensor(0.000192862455)

The same happens when I try to normalize the std. The resulting std is not exactly 1. (All I do is divide x_centered by the std. I have no constant features that could cause std=0)

What could be the reason for these instabilities? I am using torch.float32 values. In the original dataset the smallest value is -0.999 and the largest is 1.0. The mean among all dimensions is 0.11

I read that very large values can cause bugs like this, but I have only small values. Am I not using the right data type?

EDIT - Better results with double instead of float
I realized that there is also a double data type. When I convert the dataset to double before centering the mean, I get a much better result. The mean reaches exactly 0, but as soon as I divide by the STD, the mean deviates again a little bit from 0.0 (range -0.00001 to 0.000001). But still better than before.
Would you recommend centering it again after dividing by the STD? I saw that sklearn is doing that (https://github.com/scikit-learn/scikit-learn/blob/7389dbac82d362f296dc2746f10e43ffa1615660/sklearn/preprocessing/data.py#L158)

albanD · April 13, 2021, 3:21pm

Hi,

I’m afraid it is indeed a problem of numerical precision of floating point numbers The problem here is mostly that the dimension you reduce over (178741) is big enough that all the very small errors made by each individual op build up to something that is visible at the end. Also keep in mind that floating point numbers have only 6 significant digits. So anything smaller than 1e-6 here would be “noise”. With double precision, you can expect 1e-11.
If you do need more precision, double precision numbers are the way to go indeed.

Note though that it will be significantly slower to perform computation in double precision on GPU.