Computing the mean and std of dataset

yes! Because this is computing a different std. In the other code, the purpose was to compute the mean and std for each batch over the pixels, and then take the average of them. This is specifically applicable for Batch-Normalization.

However, the purpose of this code that you have posted is to compute the mean of the entire data and std over the entire data, which is different from batch-normalization.

8 Likes

I have tried the method wrote by @ptrblck (thanks to prtblck), it works but the cost on CPU is too much, more than 1200%, how can I change the code or settings that make it more efficient.

Thanks in advance. :smiley:

1 Like

Usually you would only compute it once on your dataset.
Why do you think the CPU utilization is too high?

1 Like

This version should run much faster and compute the same result as std(), though you need to be careful about overflow:

  #
  # True standard deviation
  #

  loader = torch.utils.data.DataLoader(
      your_dataset,
      batch_size=10,
      num_workers=0,
      shuffle=False
  )

  mean = 0.
  meansq = 0.
  for data in loader:
      mean = data.mean()
      meansq = (data**2).mean()

  std = torch.sqrt(meansq - mean**2)
  print("mean: " + str(mean))
  print("std: " + str(std))
  print()
4 Likes

I tried your code, but one error raised: mean=data.mean()
AttributeError: ‘list’ object has no attribute ‘mean’

I mean, for the two codes you posted, each does different things.

The first is the mean over the means in each image.

The second is the mean over all images.

Why should they be the same?

If all images are of the same size and all batches are of the same size they’re mathematically equal. In OP’s case (three-channel images of different shapes) that’s of course not the case.

It might be a bit too late, but PIL provides nice functionality for your problem in the ImageStat.Stat class. Its calculations are based on the histogram of the images and therefore only need O(1) memory, but it only considers one image. In order to deal with more images, I extended the Stat class by introducing an __add__ method which combines two histograms of the given objects (and therefore is a bit like concatenating two images and generate the Stat object out of them):

class Stats(ImageStat.Stat):
    def __add__(self, other):
        return Stats(list(map(add, self.h, other.h)))

The histogram is stored in h, both histograms (of self and other) are summed up and then a new Stat class is initialized with the new histogram instead of an image.

Using this new “Stats” class i could do something like:

loader = DataLoader(dataset, batch_size=10, num_workers=5)

statistics = None
for data in loader:
    for b in range(data.shape[0]):
        if statistics is None:
            statistics = Stats(tf.to_pil_image(data[b]))
        else:
            statistics += Stats(tf.to_pil_image(data[b]))

And from there on use normal Stat calls like:

print(f'mean:{statistics.mean}, std:{statistics.stddev}')
# mean:[199.59, 156.30, 170.59], std:[31.30, 31.28, 35.95]

Note that although this is quite a neat solution, it is by far not the most efficient.

2 Likes
import tensorflow as tf
from PIL import ImageStat

This works for me, thanks, better and stable than other method I tried

1 Like

@ptrblck @vmirly1 I stilll have a doubt regarding normalization. I want to normalize my data, what approach should I take ?
I dont want to resize my data, as image size varies from (20,80,3) to (253,80,3) it varies a lot.
And I wanna try Resnet’s which uses batch normalization, do i need to a batch normalization?
And again if I dont resize it, while using the code with the data loader, I get an error saying “invalid argument”.
Your clarification and help would much appreciated. :grinning:

Could you post the complete error message please?

#doesnt work without without batch size equal to 1.
dataset = datasets.ImageFolder(('train'),transform = transforms.ToTensor())
loader = torch.utils.data.DataLoader(dataset,
                                     batch_size = 10,
                                     num_workers = 0,
                                     shuffle = False)
mean = 0.0
for images,_ in loader:
  print(images.shape)
  batch_samples = images.size(0)
  
  images = images.view(batch_samples,images.size(1),-1)
  print(images.shape)
  mean += images.mean(2).sum(0)
  print(images.mean(2).sum(0))
  break
mean = mean/len(loader.dataset)

Error Message:

I think there is no issue in the mean computation, but the problem is in the data-loader. You cannot put images of different sizes in a batch. So, you may want to try iterating thought the dataset with a batch_size=1

1 Like

This is my solution:

mean = 0.0
meansq = 0.0
count = 0

for index, data in enumerate(train_loader):
    mean = data.sum()
    meansq = meansq + (data**2).sum()
    count += np.prod(data.shape)

total_mean = mean/count
total_var = (meansq/count) - (total_mean**2)
total_std = torch.sqrt(total_var)
print("mean: " + str(total_mean))
print("std: " + str(total_std))
4 Likes

Gave this a go like this:

Computing the mean and std of dataset

class Stats(PIL.ImageStat.Stat):
def add(self, other):
return Stats(list(map(add, self.h, other.h)))

loader = my_train_dataloader(batch_size=64, balance=False, return_path=False, verbose=False, augment=False, resize=False)

statistics = None
toPIL=transforms.ToPILImage()

print(PIL.__version__)

for data, _ in loader:
    for b in range(data.shape[0]):
        if statistics is None:
            print(type(toPIL(data[b])))
            statistics = Stats(toPIL(data[b]))
        else:
            statistics += Stats(toPIL(data[b]))
print(f'mean:{statistics.mean}, std:{statistics.stddev}')

# PIL version (pillow-simd)
6.2.2.post1
# confirming that I'm passing the right thing
<class 'PIL.Image.Image'>

but got this error:

<ipython-input-22-53e0795c502a> in __add__(self, other)
      2 class Stats(PIL.ImageStat.Stat):
      3     def __add__(self, other):
----> 4         return Stats(list(map(add, self.h, other.h)))
      5 
      6 loader = my_train_dataloader(batch_size=64, balance=False, return_path=False, verbose=False, augment=False, resize=False)

NameError: name 'add' is not defined
1 Like

probably your dataloader is returning (tensor, label) so you could modify it like:

for data, _ in loader:
import numpy as np
from PIL import ImageStat

class Stats(ImageStat.Stat):
  def __add__(self, other):
    # add self.h and other.h element-wise
    return Stats(list(np.add(self.h, other.h)))
1 Like

If I am training my model on a batch size of 4, should I compute the mean and std according to a batch size of 4? or is it more accurate to compute mean and std based on bigger batches (like 8) and then train my model on a batch size of 4?

Thanks.

Finally, do we know a good method to calculate mean and std?

Any batch_size should work. Training batch_size isn’t directly related to the batch_size you use for calculating mean and std.
You could choose 4 for both, or choose 4 and 8.