Pytorch dataloader pixel color histogram yields different results for shuffle=True/False

FelGC · January 31, 2022, 9:47am

In Pytorch, when analyzing the color histogram e.g. of the MNIST dataset using the pytorch data loader, it yields different results when shuffle is True/False. In fact, setting shuffle==True returns the correct result, whereas shuffle==False does not. Both versions (True/False) are supposed to return the set of data objects (i.e. the entire data set)

import torch
import torchvision
import torchvision.transforms as transforms
from tqdm.notebook import tqdm

transform = transforms.Compose([transforms.ToTensor()])
data_set = torchvision.datasets.MNIST(root="./", train=True, download=True, transform=transform)

dl_random = torch.utils.data.DataLoader(data_set, batch_size=1, shuffle=True)
dl_fixed = torch.utils.data.DataLoader(data_set, batch_size=1, shuffle=False)

bins = 255
hist2 = torch.zeros(bins)
for data, labels in tqdm(data_set):
    hist2 += torch.histc(data, bins=bins, min=0, max=1)
  
for name, dl in [("fixed",dl_fixed), ("random",dl_random)]:
  bins = 255
  hist = torch.zeros(bins)
  k = 0
  for data, labels in tqdm(dl):
      hist += torch.histc(data, bins=bins, min=0, max=1)
  print(name, (hist-hist2).abs().sum())

Running the above codes prints e.g.:

>>> fixed tensor(0.)
>>> random tensor(40.)

Why does it not yield the same results (“tensor(0.)”)?

ptrblck · January 31, 2022, 9:58am

The difference is caused in the large number in hist[0] and hist2[0].
The default float32 dtype can represent integers between 0 and 16777216 exactly but starts to round for larger (or smaller negative integers) as:

Integers between 2**24=16777216 and 2**25=33554432 round to a multiple of 2 (even number)
Integers between 2**25 and 2**26 round to a multiple of 4
…
Integers between 2**n and 2**n+1 round to a multiple of 2**n-23

as given in this Wikipedia article.

As you can see in your code, hist[0] will contain values >2**25 so will necessarily apply rounding during the accumulation.
Since the order of operations is not equal, you can’t expect to see the same results when shuffling is enabled or disabled. Use float64 to compare such large numbers instead (or long int etc.).
Initializing hist as:

hist = torch.zeros(bins).double()

gives a zero difference.

FelGC · January 31, 2022, 10:07am

Thank you very much, that is exactly what happened! I appreciate the rapid help and answer!