Computing the mean and std of dataset

Usually you would only compute it once on your dataset.
Why do you think the CPU utilization is too high?

1 Like

This version should run much faster and compute the same result as std(), though you need to be careful about overflow:

  #
  # True standard deviation
  #

  loader = torch.utils.data.DataLoader(
      your_dataset,
      batch_size=10,
      num_workers=0,
      shuffle=False
  )

  mean = 0.
  meansq = 0.
  for data in loader:
      mean = data.mean()
      meansq = (data**2).mean()

  std = torch.sqrt(meansq - mean**2)
  print("mean: " + str(mean))
  print("std: " + str(std))
  print()
4 Likes

I tried your code, but one error raised: mean=data.mean()
AttributeError: ‘list’ object has no attribute ‘mean’

I mean, for the two codes you posted, each does different things.

The first is the mean over the means in each image.

The second is the mean over all images.

Why should they be the same?

If all images are of the same size and all batches are of the same size they’re mathematically equal. In OP’s case (three-channel images of different shapes) that’s of course not the case.

It might be a bit too late, but PIL provides nice functionality for your problem in the ImageStat.Stat class. Its calculations are based on the histogram of the images and therefore only need O(1) memory, but it only considers one image. In order to deal with more images, I extended the Stat class by introducing an __add__ method which combines two histograms of the given objects (and therefore is a bit like concatenating two images and generate the Stat object out of them):

class Stats(ImageStat.Stat):
    def __add__(self, other):
        return Stats(list(map(add, self.h, other.h)))

The histogram is stored in h, both histograms (of self and other) are summed up and then a new Stat class is initialized with the new histogram instead of an image.

Using this new “Stats” class i could do something like:

loader = DataLoader(dataset, batch_size=10, num_workers=5)

statistics = None
for data in loader:
    for b in range(data.shape[0]):
        if statistics is None:
            statistics = Stats(tf.to_pil_image(data[b]))
        else:
            statistics += Stats(tf.to_pil_image(data[b]))

And from there on use normal Stat calls like:

print(f'mean:{statistics.mean}, std:{statistics.stddev}')
# mean:[199.59, 156.30, 170.59], std:[31.30, 31.28, 35.95]

Note that although this is quite a neat solution, it is by far not the most efficient.

3 Likes
import tensorflow as tf
from PIL import ImageStat

This works for me, thanks, better and stable than other method I tried

1 Like

@ptrblck @vmirly1 I stilll have a doubt regarding normalization. I want to normalize my data, what approach should I take ?
I dont want to resize my data, as image size varies from (20,80,3) to (253,80,3) it varies a lot.
And I wanna try Resnet’s which uses batch normalization, do i need to a batch normalization?
And again if I dont resize it, while using the code with the data loader, I get an error saying “invalid argument”.
Your clarification and help would much appreciated. :grinning:

Could you post the complete error message please?

#doesnt work without without batch size equal to 1.
dataset = datasets.ImageFolder(('train'),transform = transforms.ToTensor())
loader = torch.utils.data.DataLoader(dataset,
                                     batch_size = 10,
                                     num_workers = 0,
                                     shuffle = False)
mean = 0.0
for images,_ in loader:
  print(images.shape)
  batch_samples = images.size(0)
  
  images = images.view(batch_samples,images.size(1),-1)
  print(images.shape)
  mean += images.mean(2).sum(0)
  print(images.mean(2).sum(0))
  break
mean = mean/len(loader.dataset)

Error Message:

I think there is no issue in the mean computation, but the problem is in the data-loader. You cannot put images of different sizes in a batch. So, you may want to try iterating thought the dataset with a batch_size=1

1 Like

This is my solution:

mean = 0.0
meansq = 0.0
count = 0

for index, data in enumerate(train_loader):
    mean = data.sum()
    meansq = meansq + (data**2).sum()
    count += np.prod(data.shape)

total_mean = mean/count
total_var = (meansq/count) - (total_mean**2)
total_std = torch.sqrt(total_var)
print("mean: " + str(total_mean))
print("std: " + str(total_std))
4 Likes

Gave this a go like this:

Computing the mean and std of dataset - #12 by pete7.62

class Stats(PIL.ImageStat.Stat):
def add(self, other):
return Stats(list(map(add, self.h, other.h)))

loader = my_train_dataloader(batch_size=64, balance=False, return_path=False, verbose=False, augment=False, resize=False)

statistics = None
toPIL=transforms.ToPILImage()

print(PIL.__version__)

for data, _ in loader:
    for b in range(data.shape[0]):
        if statistics is None:
            print(type(toPIL(data[b])))
            statistics = Stats(toPIL(data[b]))
        else:
            statistics += Stats(toPIL(data[b]))
print(f'mean:{statistics.mean}, std:{statistics.stddev}')

# PIL version (pillow-simd)
6.2.2.post1
# confirming that I'm passing the right thing
<class 'PIL.Image.Image'>

but got this error:

<ipython-input-22-53e0795c502a> in __add__(self, other)
      2 class Stats(PIL.ImageStat.Stat):
      3     def __add__(self, other):
----> 4         return Stats(list(map(add, self.h, other.h)))
      5 
      6 loader = my_train_dataloader(batch_size=64, balance=False, return_path=False, verbose=False, augment=False, resize=False)

NameError: name 'add' is not defined
1 Like

probably your dataloader is returning (tensor, label) so you could modify it like:

for data, _ in loader:
import numpy as np
from PIL import ImageStat

class Stats(ImageStat.Stat):
  def __add__(self, other):
    # add self.h and other.h element-wise
    return Stats(list(np.add(self.h, other.h)))
1 Like

If I am training my model on a batch size of 4, should I compute the mean and std according to a batch size of 4? or is it more accurate to compute mean and std based on bigger batches (like 8) and then train my model on a batch size of 4?

Thanks.

Finally, do we know a good method to calculate mean and std?

Any batch_size should work. Training batch_size isn’t directly related to the batch_size you use for calculating mean and std.
You could choose 4 for both, or choose 4 and 8.

Very very late. I think this one is (almost) mathematically correct.

Instead of center crop one could run count a number of pixels, like pixel_count += images.nelement() if the image sizes are different.

dataset = datasets.ImageFolder('train', transform=transforms.Compose([transforms.ToTensor()]))

loader = data.DataLoader(dataset,
                         batch_size=10,
                         num_workers=0,
                         shuffle=False,
                         drop_last=False)

mean = 0.0
for images, _ in loader:
    batch_samples = images.size(0) 
    images = images.view(batch_samples, images.size(1), -1)
    mean += images.mean(2).sum(0)
mean = mean / len(loader.dataset)

var = 0.0
pixel_count = 0
for images, _ in loader:
    batch_samples = images.size(0)
    images = images.view(batch_samples, images.size(1), -1)
    var += ((images - mean.unsqueeze(1))**2).sum([0,2])
    pixel_count += images.nelement()
std = torch.sqrt(var / pixel_count)

The code looks good!
But there is an issue when counting num of pixels.
As we count for each channel, we should exclude the channel dimension:

pixel_count += images.nelement() / images.size(1)

The updated version:


loader = data.DataLoader(dataset,
                         batch_size=10,
                         num_workers=0,
                         shuffle=False,
                         drop_last=False)

mean = 0.0
for images, _ in loader:
    batch_samples = images.size(0) 
    images = images.view(batch_samples, images.size(1), -1)
    mean += images.mean(2).sum(0)
mean = mean / len(loader.dataset)

var = 0.0
pixel_count = 0
for images, _ in loader:
    batch_samples = images.size(0)
    images = images.view(batch_samples, images.size(1), -1)
    var += ((images - mean.unsqueeze(1))**2).sum([0,2])
    pixel_count += images.nelement() / images.size(1)
std = torch.sqrt(var / pixel_count)