I have a (178741, 304) dataset where the first dimension correponds to one sample and the second dimension are the features of a sample. The means of the features range between -0.75 and 0.99 and I want to normalize the dataset so that the mean is 0.0 afterwards.
x = the dataset x_mean = x.mean(axis=0) print("Before", x_mean.min(), x_mean.max()) x_centered = x - x_mean x_centered_mean = x_centered.mean(axis=0) print("After", x_centered_mean.min(), x_centered_mean.max())
However, after the normalization the mean is still not 0. It got much smaller, but is still relatively high:
Before tensor(-0.748372793198) tensor(0.999999940395)
After tensor(-0.001484304899) tensor(0.000192862455)
The same happens when I try to normalize the std. The resulting std is not exactly 1. (All I do is divide x_centered by the std. I have no constant features that could cause std=0)
What could be the reason for these instabilities? I am using torch.float32 values. In the original dataset the smallest value is -0.999 and the largest is 1.0. The mean among all dimensions is 0.11
I read that very large values can cause bugs like this, but I have only small values. Am I not using the right data type?
EDIT - Better results with double instead of float
I realized that there is also a double data type. When I convert the dataset to double before centering the mean, I get a much better result. The mean reaches exactly 0, but as soon as I divide by the STD, the mean deviates again a little bit from 0.0 (range -0.00001 to 0.000001). But still better than before.
Would you recommend centering it again after dividing by the STD? I saw that sklearn is doing that (scikit-learn/data.py at 7389dbac82d362f296dc2746f10e43ffa1615660 · scikit-learn/scikit-learn · GitHub)