MNIST normalization and torchvision's Normalize

I want to normalize the MNIST dataset.
Here is how I calculate mean and standard-deviation:

train_dataset = tv.datasets.MNIST('../data', train=True, download=True, transform=transform)
mean = torch.mean(torch.Tensor.float(
std = torch.std(torch.Tensor.float(

If I manually normalize the data like this: = ( - mean) / std = ( - mean) / std

I get decent accuracy (~0.978), though not better than without normalization (~0.9796).

However, if I use the Normalize transform with the same mean and std:

transform=tv.transforms.Compose([tv.transforms.ToTensor(), tv.transforms.Normalize(mean, std)])
train_dataset = tv.datasets.MNIST(’…/data’, train=True, download=True, transform=transform)
test_dataset = tv.datasets.MNIST(’…/data’, train=False, download=True, transform=transform)

I get very low accuracy (0.135). Why is that, and how should I use Normalize instead?

Second question: I also tried (manual) pixel-wise normalization:

px_mean = torch.mean(torch.Tensor.float(, dim=0)
px_std = torch.std(torch.Tensor.float(, dim=0)+1e-10 = ( = (

but this, too, gives me very low accuracy (~0.135). Am I doing it wrong, and if so, how to do it correctly?

The internal .data will store the raw dataset in uint8 with values in the range [0, 255].
The mean of these values (transformed to FloatTensors) would thus be 33.3184.
Normalizing the raw data with these values would thus work.
However, since ToTensor() already normalizes the tensors to the range [0, 1], the mean and std in transforms.Normalize should also be in this range (divided by 255.).

1 Like

I see. I tried

transform = torchvision.transforms.Compose([
torchvision.transforms.Normalize(mean=mean/255, std=std/255)])


mean = 33.3184
std = 78.5675

but this gives me lower accuracy (0.88) than ToTensor() on its own (0.98). Is that expected for MNIST?

It depends on your model and overall training, i.e.:

  • is this result giving the training, evaluation, or final test accuracy
  • how reproducible is this result (i.e. are you always seeing the difference in accuracy using different seeds)
  • how much hyper-parameter turning did you apply (learning rate, model architecture, etc.) or did you use a “static” setup?