Computing the mean and std of dataset

kuzand · January 17, 2019, 7:53pm

Hello. So I am trying to compute the mean and the standard deviation per channel of my train dataset (three-channel images of different shapes).
For the mean I can do it in two ways, but I get slightly different results.

import torch
from torchvision import datasets, transforms

dataset = datasets.ImageFolder('train',
                 transform=transforms.ToTensor())

First computation:

mean = 0.0
for img, _ in dataset:
    #mean += img.sum([1,2])/torch.numel(img[0])
    mean += img.mean([1,2])
mean = mean/len(dataset)
print(mean)
# tensor([0.3749, 0.3992, 0.4505])

Second computation:

sumel = 0.0
countel = 0
for img, _ in dataset:
    sumel += img.sum([1, 2])
    countel += torch.numel(img[0])
mean = sumel/countel
print(mean)
# tensor([0.3802, 0.4003, 0.4513])

Any idea why there is this small difference in the two computations?

Similarly for the std

sumel = 0.0
countel = 0
for img, _ in dataset:
    img = (img - mean.unsqueeze(1).unsqueeze(1))**2
    sumel += img.sum([1, 2])
    countel += torch.numel(img[0])
std = torch.sqrt(sumel/countel)

Is it a correct way to compute it?

kuzand · January 17, 2019, 9:01pm

Ok, I think this small discrepancy is due to some numerical issue (floating point error?).
Anyway, the above method of computing mean and std is not efficient for big datasets. So it is better to use the dataloader, after first resizing the images.
I found here About Normalization using pre-trained vgg16 networks how compute the mean and std (thanks to ptrblck),

loader = data.DataLoader(dataset,
                         batch_size=10,
                         num_workers=0,
                         shuffle=False)

mean = 0.
std = 0.
for images, _ in loader:
    batch_samples = images.size(0) # batch size (the last batch can have smaller size!)
    images = images.view(batch_samples, images.size(1), -1)
    mean += images.mean(2).sum(0)
    std += images.std(2).sum(0)

mean /= len(loader.dataset)
std /= len(loader.dataset)

However I have doubts for the correctness of the computation of the std. In the above code the std’s of all the images are summed and at the end they are averaged by the total number of images. But I think that the total std should be computed over all the pixel values of all the images in the dataset, as in my previous post.

vmirly1 · January 17, 2019, 11:39pm

I think in the other post by @ptrblck, he is computing the mean and std over the pixels not over samples in the batch. So, then that code in About Normalization using pre-trained vgg16 networks is correct, since the goal is to compute the mean and std for each batch and then take the average of these two quantities over the entire dataset.

kuzand · January 18, 2019, 12:29am

What about this one:

dataset = datasets.ImageFolder('train', transform=transforms.Compose([transforms.Resize(256),
                             transforms.CenterCrop(224),
                             transforms.ToTensor()]))

loader = data.DataLoader(dataset,
                         batch_size=10,
                         num_workers=0,
                         shuffle=False)

mean = 0.0
for images, _ in loader:
    batch_samples = images.size(0) 
    images = images.view(batch_samples, images.size(1), -1)
    mean += images.mean(2).sum(0)
mean = mean / len(loader.dataset)

var = 0.0
for images, _ in loader:
    batch_samples = images.size(0)
    images = images.view(batch_samples, images.size(1), -1)
    var += ((images - mean.unsqueeze(1))**2).sum([0,2])
std = torch.sqrt(var / (len(loader.dataset)*224*224))

This also gives reasonable values for std, but different from ptrblck std.

vmirly1 · January 18, 2019, 12:36am

yes! Because this is computing a different std. In the other code, the purpose was to compute the mean and std for each batch over the pixels, and then take the average of them. This is specifically applicable for Batch-Normalization.

However, the purpose of this code that you have posted is to compute the mean of the entire data and std over the entire data, which is different from batch-normalization.

MariosOreo · January 18, 2019, 1:39pm

I have tried the method wrote by @ptrblck (thanks to prtblck), it works but the cost on CPU is too much, more than 1200%, how can I change the code or settings that make it more efficient.

Thanks in advance.

ptrblck · January 19, 2019, 3:25am

Usually you would only compute it once on your dataset.
Why do you think the CPU utilization is too high?

jcowles · May 13, 2019, 7:16pm

This version should run much faster and compute the same result as std(), though you need to be careful about overflow:

  #
  # True standard deviation
  #

  loader = torch.utils.data.DataLoader(
      your_dataset,
      batch_size=10,
      num_workers=0,
      shuffle=False
  )

  mean = 0.
  meansq = 0.
  for data in loader:
      mean = data.mean()
      meansq = (data**2).mean()

  std = torch.sqrt(meansq - mean**2)
  print("mean: " + str(mean))
  print("std: " + str(std))
  print()

JessicaDuFirst · October 31, 2019, 12:51pm

I tried your code, but one error raised: mean=data.mean()
AttributeError: ‘list’ object has no attribute ‘mean’

IngoMarquart · November 1, 2019, 9:23am

I mean, for the two codes you posted, each does different things.

The first is the mean over the means in each image.

The second is the mean over all images.

Why should they be the same?

pete7.62 · December 4, 2019, 1:46pm

If all images are of the same size and all batches are of the same size they’re mathematically equal. In OP’s case (three-channel images of different shapes) that’s of course not the case.

thisismexp · December 4, 2019, 3:29pm

It might be a bit too late, but PIL provides nice functionality for your problem in the ImageStat.Stat class. Its calculations are based on the histogram of the images and therefore only need O(1) memory, but it only considers one image. In order to deal with more images, I extended the Stat class by introducing an __add__ method which combines two histograms of the given objects (and therefore is a bit like concatenating two images and generate the Stat object out of them):

class Stats(ImageStat.Stat):
    def __add__(self, other):
        return Stats(list(map(add, self.h, other.h)))

The histogram is stored in h, both histograms (of self and other) are summed up and then a new Stat class is initialized with the new histogram instead of an image.

Using this new “Stats” class i could do something like:

loader = DataLoader(dataset, batch_size=10, num_workers=5)

statistics = None
for data in loader:
    for b in range(data.shape[0]):
        if statistics is None:
            statistics = Stats(tf.to_pil_image(data[b]))
        else:
            statistics += Stats(tf.to_pil_image(data[b]))

And from there on use normal Stat calls like:

print(f'mean:{statistics.mean}, std:{statistics.stddev}')
# mean:[199.59, 156.30, 170.59], std:[31.30, 31.28, 35.95]

Note that although this is quite a neat solution, it is by far not the most efficient.

Hong_Cheng · January 18, 2020, 12:24am

import tensorflow as tf
from PIL import ImageStat

This works for me, thanks, better and stable than other method I tried

prateekgupta891 · March 30, 2020, 6:27pm

@ptrblck @vmirly1 I stilll have a doubt regarding normalization. I want to normalize my data, what approach should I take ?
I dont want to resize my data, as image size varies from (20,80,3) to (253,80,3) it varies a lot.
And I wanna try Resnet’s which uses batch normalization, do i need to a batch normalization?
And again if I dont resize it, while using the code with the data loader, I get an error saying “invalid argument”.
Your clarification and help would much appreciated.

ptrblck · March 31, 2020, 6:04am

Could you post the complete error message please?

prateekgupta891 · April 1, 2020, 9:54am

#doesnt work without without batch size equal to 1.
dataset = datasets.ImageFolder(('train'),transform = transforms.ToTensor())
loader = torch.utils.data.DataLoader(dataset,
                                     batch_size = 10,
                                     num_workers = 0,
                                     shuffle = False)
mean = 0.0
for images,_ in loader:
  print(images.shape)
  batch_samples = images.size(0)
  
  images = images.view(batch_samples,images.size(1),-1)
  print(images.shape)
  mean += images.mean(2).sum(0)
  print(images.mean(2).sum(0))
  break
mean = mean/len(loader.dataset)

Error Message:

vmirly1 · April 1, 2020, 1:24pm

I think there is no issue in the mean computation, but the problem is in the data-loader. You cannot put images of different sizes in a batch. So, you may want to try iterating thought the dataset with a batch_size=1

caped-vigilante · April 15, 2020, 9:03am

This is my solution:

mean = 0.0
meansq = 0.0
count = 0

for index, data in enumerate(train_loader):
    mean = data.sum()
    meansq = meansq + (data**2).sum()
    count += np.prod(data.shape)

total_mean = mean/count
total_var = (meansq/count) - (total_mean**2)
total_std = torch.sqrt(total_var)
print("mean: " + str(total_mean))
print("std: " + str(total_std))

l00p · October 18, 2020, 4:42pm

Gave this a go like this:

Computing the mean and std of dataset - #12 by pete7.62

class Stats(PIL.ImageStat.Stat):
def add(self, other):
return Stats(list(map(add, self.h, other.h)))

loader = my_train_dataloader(batch_size=64, balance=False, return_path=False, verbose=False, augment=False, resize=False)

statistics = None
toPIL=transforms.ToPILImage()

print(PIL.__version__)

for data, _ in loader:
    for b in range(data.shape[0]):
        if statistics is None:
            print(type(toPIL(data[b])))
            statistics = Stats(toPIL(data[b]))
        else:
            statistics += Stats(toPIL(data[b]))
print(f'mean:{statistics.mean}, std:{statistics.stddev}')

# PIL version (pillow-simd)
6.2.2.post1
# confirming that I'm passing the right thing
<class 'PIL.Image.Image'>

but got this error:

<ipython-input-22-53e0795c502a> in __add__(self, other)
      2 class Stats(PIL.ImageStat.Stat):
      3     def __add__(self, other):
----> 4         return Stats(list(map(add, self.h, other.h)))
      5 
      6 loader = my_train_dataloader(batch_size=64, balance=False, return_path=False, verbose=False, augment=False, resize=False)

NameError: name 'add' is not defined

l00p · October 18, 2020, 4:46pm

probably your dataloader is returning (tensor, label) so you could modify it like:

for data, _ in loader: