Computing the mean and std of dataset

kuzand · January 17, 2019, 9:01pm

Ok, I think this small discrepancy is due to some numerical issue (floating point error?).
Anyway, the above method of computing mean and std is not efficient for big datasets. So it is better to use the dataloader, after first resizing the images.
I found here About Normalization using pre-trained vgg16 networks how compute the mean and std (thanks to ptrblck),

loader = data.DataLoader(dataset,
                         batch_size=10,
                         num_workers=0,
                         shuffle=False)

mean = 0.
std = 0.
for images, _ in loader:
    batch_samples = images.size(0) # batch size (the last batch can have smaller size!)
    images = images.view(batch_samples, images.size(1), -1)
    mean += images.mean(2).sum(0)
    std += images.std(2).sum(0)

mean /= len(loader.dataset)
std /= len(loader.dataset)

However I have doubts for the correctness of the computation of the std. In the above code the std’s of all the images are summed and at the end they are averaged by the total number of images. But I think that the total std should be computed over all the pixel values of all the images in the dataset, as in my previous post.