Computing the mean and std of dataset

Hello. So I am trying to compute the mean and the standard deviation per channel of my train dataset (three-channel images of different shapes).
For the mean I can do it in two ways, but I get slightly different results.

import torch
from torchvision import datasets, transforms

dataset = datasets.ImageFolder('train',
                 transform=transforms.ToTensor())

First computation:

mean = 0.0
for img, _ in dataset:
    #mean += img.sum([1,2])/torch.numel(img[0])
    mean += img.mean([1,2])
mean = mean/len(dataset)
print(mean)
# tensor([0.3749, 0.3992, 0.4505])

Second computation:

sumel = 0.0
countel = 0
for img, _ in dataset:
    sumel += img.sum([1, 2])
    countel += torch.numel(img[0])
mean = sumel/countel
print(mean)
# tensor([0.3802, 0.4003, 0.4513])

Any idea why there is this small difference in the two computations?

Similarly for the std

sumel = 0.0
countel = 0
for img, _ in dataset:
    img = (img - mean.unsqueeze(1).unsqueeze(1))**2
    sumel += img.sum([1, 2])
    countel += torch.numel(img[0])
std = torch.sqrt(sumel/countel)

Is it a correct way to compute it?

2 Likes

Ok, I think this small discrepancy is due to some numerical issue (floating point error?).
Anyway, the above method of computing mean and std is not efficient for big datasets. So it is better to use the dataloader, after first resizing the images.
I found here About Normalization using pre-trained vgg16 networks how compute the mean and std (thanks to ptrblck),

loader = data.DataLoader(dataset,
                         batch_size=10,
                         num_workers=0,
                         shuffle=False)

mean = 0.
std = 0.
for images, _ in loader:
    batch_samples = images.size(0) # batch size (the last batch can have smaller size!)
    images = images.view(batch_samples, images.size(1), -1)
    mean += images.mean(2).sum(0)
    std += images.std(2).sum(0)

mean /= len(loader.dataset)
std /= len(loader.dataset)

However I have doubts for the correctness of the computation of the std. In the above code the std’s of all the images are summed and at the end they are averaged by the total number of images. But I think that the total std should be computed over all the pixel values of all the images in the dataset, as in my previous post.

3 Likes

I think in the other post by @ptrblck, he is computing the mean and std over the pixels not over samples in the batch. So, then that code in About Normalization using pre-trained vgg16 networks is correct, since the goal is to compute the mean and std for each batch and then take the average of these two quantities over the entire dataset.

3 Likes

What about this one:

dataset = datasets.ImageFolder('train', transform=transforms.Compose([transforms.Resize(256),
                             transforms.CenterCrop(224),
                             transforms.ToTensor()]))

loader = data.DataLoader(dataset,
                         batch_size=10,
                         num_workers=0,
                         shuffle=False)

mean = 0.0
for images, _ in loader:
    batch_samples = images.size(0) 
    images = images.view(batch_samples, images.size(1), -1)
    mean += images.mean(2).sum(0)
mean = mean / len(loader.dataset)

var = 0.0
for images, _ in loader:
    batch_samples = images.size(0)
    images = images.view(batch_samples, images.size(1), -1)
    var += ((images - mean.unsqueeze(1))**2).sum([0,2])
std = torch.sqrt(var / (len(loader.dataset)*224*224))

This also gives reasonable values for std, but different from ptrblck std.

5 Likes

yes! Because this is computing a different std. In the other code, the purpose was to compute the mean and std for each batch over the pixels, and then take the average of them. This is specifically applicable for Batch-Normalization.

However, the purpose of this code that you have posted is to compute the mean of the entire data and std over the entire data, which is different from batch-normalization.

8 Likes

I have tried the method wrote by @ptrblck (thanks to prtblck), it works but the cost on CPU is too much, more than 1200%, how can I change the code or settings that make it more efficient.

Thanks in advance. :smiley:

Usually you would only compute it once on your dataset.
Why do you think the CPU utilization is too high?

This version should run much faster and compute the same result as std(), though you need to be careful about overflow:

  #
  # True standard deviation
  #

  loader = torch.utils.data.DataLoader(
      your_dataset,
      batch_size=10,
      num_workers=0,
      shuffle=False
  )

  mean = 0.
  meansq = 0.
  for data in loader:
      mean = data.mean()
      meansq = (data**2).mean()

  std = torch.sqrt(meansq - mean**2)
  print("mean: " + str(mean))
  print("std: " + str(std))
  print()
4 Likes

I tried your code, but one error raised: mean=data.mean()
AttributeError: ‘list’ object has no attribute ‘mean’

I mean, for the two codes you posted, each does different things.

The first is the mean over the means in each image.

The second is the mean over all images.

Why should they be the same?

If all images are of the same size and all batches are of the same size they’re mathematically equal. In OP’s case (three-channel images of different shapes) that’s of course not the case.

It might be a bit too late, but PIL provides nice functionality for your problem in the ImageStat.Stat class. Its calculations are based on the histogram of the images and therefore only need O(1) memory, but it only considers one image. In order to deal with more images, I extended the Stat class by introducing an __add__ method which combines two histograms of the given objects (and therefore is a bit like concatenating two images and generate the Stat object out of them):

class Stats(ImageStat.Stat):
    def __add__(self, other):
        return Stats(list(map(add, self.h, other.h)))

The histogram is stored in h, both histograms (of self and other) are summed up and then a new Stat class is initialized with the new histogram instead of an image.

Using this new “Stats” class i could do something like:

loader = DataLoader(dataset, batch_size=10, num_workers=5)

statistics = None
for data in loader:
    for b in range(data.shape[0]):
        if statistics is None:
            statistics = Stats(tf.to_pil_image(data[b]))
        else:
            statistics += Stats(tf.to_pil_image(data[b]))

And from there on use normal Stat calls like:

print(f'mean:{statistics.mean}, std:{statistics.stddev}')
# mean:[199.59, 156.30, 170.59], std:[31.30, 31.28, 35.95]

Note that although this is quite a neat solution, it is by far not the most efficient.

2 Likes
import tensorflow as tf
from PIL import ImageStat

This works for me, thanks, better and stable than other method I tried

1 Like

@ptrblck @vmirly1 I stilll have a doubt regarding normalization. I want to normalize my data, what approach should I take ?
I dont want to resize my data, as image size varies from (20,80,3) to (253,80,3) it varies a lot.
And I wanna try Resnet’s which uses batch normalization, do i need to a batch normalization?
And again if I dont resize it, while using the code with the data loader, I get an error saying “invalid argument”.
Your clarification and help would much appreciated. :grinning:

Could you post the complete error message please?

#doesnt work without without batch size equal to 1.
dataset = datasets.ImageFolder(('train'),transform = transforms.ToTensor())
loader = torch.utils.data.DataLoader(dataset,
                                     batch_size = 10,
                                     num_workers = 0,
                                     shuffle = False)
mean = 0.0
for images,_ in loader:
  print(images.shape)
  batch_samples = images.size(0)
  
  images = images.view(batch_samples,images.size(1),-1)
  print(images.shape)
  mean += images.mean(2).sum(0)
  print(images.mean(2).sum(0))
  break
mean = mean/len(loader.dataset)

Error Message:

I think there is no issue in the mean computation, but the problem is in the data-loader. You cannot put images of different sizes in a batch. So, you may want to try iterating thought the dataset with a batch_size=1

1 Like

This is my solution:

mean = 0.0
meansq = 0.0
count = 0

for index, data in enumerate(train_loader):
    mean = data.sum()
    meansq = meansq + (data**2).sum()
    count += np.prod(data.shape)

total_mean = mean/count
total_var = (meansq/count) - (total_mean**2)
total_std = torch.sqrt(total_var)
print("mean: " + str(total_mean))
print("std: " + str(total_std))
4 Likes

Gave this a go like this:

Computing the mean and std of dataset

class Stats(PIL.ImageStat.Stat):
def add(self, other):
return Stats(list(map(add, self.h, other.h)))

loader = my_train_dataloader(batch_size=64, balance=False, return_path=False, verbose=False, augment=False, resize=False)

statistics = None
toPIL=transforms.ToPILImage()

print(PIL.__version__)

for data, _ in loader:
    for b in range(data.shape[0]):
        if statistics is None:
            print(type(toPIL(data[b])))
            statistics = Stats(toPIL(data[b]))
        else:
            statistics += Stats(toPIL(data[b]))
print(f'mean:{statistics.mean}, std:{statistics.stddev}')

# PIL version (pillow-simd)
6.2.2.post1
# confirming that I'm passing the right thing
<class 'PIL.Image.Image'>

but got this error:

<ipython-input-22-53e0795c502a> in __add__(self, other)
      2 class Stats(PIL.ImageStat.Stat):
      3     def __add__(self, other):
----> 4         return Stats(list(map(add, self.h, other.h)))
      5 
      6 loader = my_train_dataloader(batch_size=64, balance=False, return_path=False, verbose=False, augment=False, resize=False)

NameError: name 'add' is not defined
1 Like

probably your dataloader is returning (tensor, label) so you could modify it like:

for data, _ in loader: