Why do you step through the data in loader twice, once for mean and once for std? Wouldn’t it be quicker to calculate both at the same time?

Sure, you can go in one pass accumulating the total number of samples, the total sum, and the total sum of squares, then when you get these you can use them to get the mean and the std.

Thanks for your solution. The std computation still seems to be incorrect. I tried to re-implement it with a comparison with torch.mean and torch.std. I could get exactly the same results.

```
class MyDataset(Dataset):
def __init__(self):
self.data = torch.randn(1000, 3, 224, 224)
def __getitem__(self, index):
x = self.data[index]
return x
def __len__(self):
return len(self.data)
def main():
device = torch.device("cuda")
dataset = MyDataset()
start = timeit.time.perf_counter()
data = dataset.data.to(device)
print("Mean:", torch.mean(data, dim=(0, 2, 3)))
print("Std:", torch.std(data, dim=(0, 2, 3)))
print("Elapsed time: %.3f seconds" % (timeit.time.perf_counter() - start))
print()
start = timeit.time.perf_counter()
mean = 0.
for data in dataset:
data = data.to(device)
mean += torch.mean(data, dim=(1, 2))
mean /= len(dataset)
print("Mean:", mean)
temp = 0.
nb_samples = 0.
for data in dataset:
data = data.to(device)
temp += ((data.view(3, -1) - mean.unsqueeze(1)) ** 2).sum(dim=1)
nb_samples += np.prod(data.size()[1:])
std = torch.sqrt(temp/nb_samples)
print("Std:", std)
print("Elapsed time: %.3f seconds" % (timeit.time.perf_counter() - start))
```

People finding this post, please be careful:

```
avg(std(minibatch_1) + std(minibatch_2) + .. ) != std(dataset)
```

Rather compute the `avg(var(minibatch_1) + var(minibatch_2) + ..)`

and take its `sqrt(..)`

as per SO link shared by @amit_js.

With the first approach (average of the std):

E[(sqrt(S_1) + sqrt(S_2) + … sqrt(S_n)) / n] = E[sqrt(S_1)] if the S_i are iid like in our case (E stay for expected value and S_i are the sample variances of each mini-batch).

E[sqrt(S_1)] <= sqrt(E[S_1]) = std(X) for the Jensen’s inequality.

With the second approach (sqrt of the average of the var):

E[sqrt((S_1 + S_2 + … S_n) / n)] = E[sqrt(S_tot)] <= sqrt(E[(S_tot)] = sqrt(var(X)) = std

So both approaches underestimate the real std (correct?).

I’m missing sometime? there is a way to demonstrate that the second approach is better than the first approach?

```
mean = 0.
std = 0.
nb_samples = 0.
for data in dataloader:
print(type(data))
batch_samples = data.size(0)
data.shape(0)
data = data.view(batch_samples, data.size(1), -1)
mean += data.mean(2).sum(0)
std += data.std(2).sum(0)
nb_samples += batch_samples
mean /= nb_samples
std /= nb_samples
```

error is:

```
<class 'dict'>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-51-e8ba3c8718bb> in <module>
5 for data in dataloader:
6 print(type(data))
----> 7 batch_samples = data.size(0)
8
9 data.shape(0)
AttributeError: 'dict' object has no attribute 'size'
```

this is print(data) result:

```
{'image': tensor([[[[0.2961, 0.2941, 0.2941, ..., 0.2460, 0.2456, 0.2431],
[0.2953, 0.2977, 0.2980, ..., 0.2442, 0.2431, 0.2431],
[0.2941, 0.2941, 0.2980, ..., 0.2471, 0.2471, 0.2448],
...,
[0.3216, 0.3216, 0.3216, ..., 0.2482, 0.2471, 0.2471],
[0.3216, 0.3241, 0.3253, ..., 0.2471, 0.2471, 0.2450],
[0.3216, 0.3216, 0.3216, ..., 0.2471, 0.2452, 0.2431]],
[[0.2961, 0.2941, 0.2941, ..., 0.2460, 0.2456, 0.2431],
[0.2953, 0.2977, 0.2980, ..., 0.2442, 0.2431, 0.2431],
[0.2941, 0.2941, 0.2980, ..., 0.2471, 0.2471, 0.2448],
...,
[0.3216, 0.3216, 0.3216, ..., 0.2482, 0.2471, 0.2471],
[0.3216, 0.3241, 0.3253, ..., 0.2471, 0.2471, 0.2450],
[0.3216, 0.3216, 0.3216, ..., 0.2471, 0.2452, 0.2431]],
[[0.2961, 0.2941, 0.2941, ..., 0.2460, 0.2456, 0.2431],
[0.2953, 0.2977, 0.2980, ..., 0.2442, 0.2431, 0.2431],
[0.2941, 0.2941, 0.2980, ..., 0.2471, 0.2471, 0.2448],
...,
[0.3216, 0.3216, 0.3216, ..., 0.2482, 0.2471, 0.2471],
[0.3216, 0.3241, 0.3253, ..., 0.2471, 0.2471, 0.2450],
[0.3216, 0.3216, 0.3216, ..., 0.2471, 0.2452, 0.2431]]],
[[[0.3059, 0.3093, 0.3140, ..., 0.3373, 0.3363, 0.3345],
[0.3059, 0.3093, 0.3165, ..., 0.3412, 0.3389, 0.3373],
[0.3098, 0.3131, 0.3176, ..., 0.3450, 0.3412, 0.3412],
...,
[0.2931, 0.2966, 0.2931, ..., 0.2549, 0.2539, 0.2510],
[0.2902, 0.2902, 0.2902, ..., 0.2510, 0.2510, 0.2502],
[0.2864, 0.2900, 0.2863, ..., 0.2510, 0.2510, 0.2510]],
[[0.3059, 0.3093, 0.3140, ..., 0.3373, 0.3363, 0.3345],
[0.3059, 0.3093, 0.3165, ..., 0.3412, 0.3389, 0.3373],
[0.3098, 0.3131, 0.3176, ..., 0.3450, 0.3412, 0.3412],
...,
[0.2931, 0.2966, 0.2931, ..., 0.2549, 0.2539, 0.2510],
[0.2902, 0.2902, 0.2902, ..., 0.2510, 0.2510, 0.2502],
[0.2864, 0.2900, 0.2863, ..., 0.2510, 0.2510, 0.2510]],
[[0.3059, 0.3093, 0.3140, ..., 0.3373, 0.3363, 0.3345],
[0.3059, 0.3093, 0.3165, ..., 0.3412, 0.3389, 0.3373],
[0.3098, 0.3131, 0.3176, ..., 0.3450, 0.3412, 0.3412],
...,
[0.2931, 0.2966, 0.2931, ..., 0.2549, 0.2539, 0.2510],
[0.2902, 0.2902, 0.2902, ..., 0.2510, 0.2510, 0.2502],
[0.2864, 0.2900, 0.2863, ..., 0.2510, 0.2510, 0.2510]]],
[[[0.2979, 0.2980, 0.3015, ..., 0.2825, 0.2784, 0.2784],
[0.2980, 0.2980, 0.2980, ..., 0.2830, 0.2764, 0.2795],
[0.2980, 0.2980, 0.3012, ..., 0.2827, 0.2814, 0.2797],
...,
[0.3282, 0.3293, 0.3294, ..., 0.2238, 0.2235, 0.2235],
[0.3255, 0.3255, 0.3255, ..., 0.2240, 0.2235, 0.2229],
[0.3225, 0.3255, 0.3255, ..., 0.2216, 0.2235, 0.2223]],
[[0.2979, 0.2980, 0.3015, ..., 0.2825, 0.2784, 0.2784],
[0.2980, 0.2980, 0.2980, ..., 0.2830, 0.2764, 0.2795],
[0.2980, 0.2980, 0.3012, ..., 0.2827, 0.2814, 0.2797],
...,
[0.3282, 0.3293, 0.3294, ..., 0.2238, 0.2235, 0.2235],
[0.3255, 0.3255, 0.3255, ..., 0.2240, 0.2235, 0.2229],
[0.3225, 0.3255, 0.3255, ..., 0.2216, 0.2235, 0.2223]],
[[0.2979, 0.2980, 0.3015, ..., 0.2825, 0.2784, 0.2784],
[0.2980, 0.2980, 0.2980, ..., 0.2830, 0.2764, 0.2795],
[0.2980, 0.2980, 0.3012, ..., 0.2827, 0.2814, 0.2797],
...,
[0.3282, 0.3293, 0.3294, ..., 0.2238, 0.2235, 0.2235],
[0.3255, 0.3255, 0.3255, ..., 0.2240, 0.2235, 0.2229],
[0.3225, 0.3255, 0.3255, ..., 0.2216, 0.2235, 0.2223]]]],
dtype=torch.float64), 'landmarks': tensor([[[160.2964, 98.7339],
[223.0788, 72.5067],
[ 82.4163, 70.3733],
[152.3213, 137.7867]],
[[198.3194, 74.4341],
[273.7188, 118.7733],
[117.7113, 80.8000],
[182.0750, 107.2533]],
[[137.4789, 92.8523],
[174.9463, 40.3467],
[ 57.3013, 59.1200],
[129.3375, 131.6533]]], dtype=torch.float64)}
```

```
dataloader = DataLoader(transformed_dataset, batch_size=3,
shuffle=True, num_workers=4)
```

and

```
transformed_dataset = MothLandmarksDataset(csv_file='moth_gt.csv',
root_dir='.',
transform=transforms.Compose(
[
Rescale(256),
RandomCrop(224),
ToTensor()#,
##transforms.Normalize(mean = [ 0.485, 0.456, 0.406 ],
## std = [ 0.229, 0.224, 0.225 ])
]
)
)
```

and

```
class MothLandmarksDataset(Dataset):
"""Face Landmarks dataset."""
def __init__(self, csv_file, root_dir, transform=None):
"""
Args:
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.landmarks_frame = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.landmarks_frame)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
img_name = os.path.join(self.root_dir, self.landmarks_frame.iloc[idx, 0])
image = io.imread(img_name)
landmarks = self.landmarks_frame.iloc[idx, 1:]
landmarks = np.array([landmarks])
landmarks = landmarks.astype('float').reshape(-1, 2)
sample = {'image': image, 'landmarks': landmarks}
if self.transform:
sample = self.transform(sample)
return sample
```

Your `data`

tensor is a `dict`

so you would need to access the `image`

inside it.

Hi @ptrblck,

I want to compute the mean and std deviation of the latent space of the autoencoders while training the autoencoders. Can you suggest a method for that?

Thanks,

I’m not sure if I understand the use case correctly, but you could use `torch.mean`

and `torch.std`

on the latent activation tensor during the forward pass.

If you want to calculate these stats for the latent tensors of the complete dataset, you could store these activations by returning them directly in the `forward`

or via a forward hook and calculate the stats after the whole epoch is done.

Thanks for your response.

Wouldn’t it be computationally expensive to store latent tensors of the entire dataset?

Can it be done for every batch and then take an average of that?

Can you give a code snippet of the same.

It depends how large the dataset is and how large each latent tensor is.

If you cannot store all tensors during training, you would have to calculate the stats on the fly.

Here is an example of using forward hooks.

Please answer this also: Is it possible per batch and then take an average of that?

As mentioned in above dicussion, it’s a very rough estimate and for varied images will not a right estimator. Better to calculate mean and standard deviation using Wellford’s method. It’s numerically stable as well as fairly fast.

Read more about it here :

https://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/#:~:text=The%20definition%20can%20be%20converted,squared%20differences%20from%20the%20mean.

I think @ptrblck’s answer is not correct or not very accurate as many pointed out. I use two passes of the `dataloader`

to get the exact value:

```
transform = transforms.Compose([transforms.ToTensor(),])
dataset = datasets.CIFAR10(root='cifar10', train=True, download=False,transform=transform)
dataloader = DataLoader(dataset, batch_size=1, num_workers=1, shuffle=False)
mean = torch.zeros(3)
std = torch.zeros(3)
for i, data in enumerate(dataloader):
if (i % 10000 == 0): print(i)
data = data[0].squeeze(0)
if (i == 0): size = data.size(1) * data.size(2)
mean += data.sum((1, 2)) / size
mean /= len(dataloader)
print(mean)
mean = mean.unsqueeze(1).unsqueeze(2)
for i, data in enumerate(dataloader):
if (i % 10000 == 0): print(i)
data = data[0].squeeze(0)
std += ((data - mean) ** 2).sum((1, 2)) / size
std /= len(dataloader)
std = std.sqrt()
print(std)
```

with output:

```
tensor([0.4914, 0.4822, 0.4465])
tensor([0.2470, 0.2435, 0.2616])
```

Approaching this topic now in October 2021 I see that this thread is the main go-to for calculating normalization values using pytorch. Calculating these values seems like a standard computation that should be implemented in standard libraries. Does anybody know if the computation is implemented in one of the standard libraries like core pytorch, pytorch lightning or fastapi? Perhaps this should be raised as an issue on github in these libraries.

This method should word on any image dataset, may be with slight tweaks. It has been tested and returns the same results as using torch.mean and torch.std on the entire dataset. Assuming we have image data in the format C * H * W

Hi Piotr,

What if our loader is dependent on the transform which is dependent on the mean std var of each of the train, val, test sets?

I have the following from the PyTorch tutorial for Inception V3:

```
data_transforms = {
'train': transforms.Compose([
transforms.RandomResizedCrop(299),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
'val': transforms.Compose([
transforms.Resize(299),
transforms.CenterCrop(299),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
'test': transforms.Compose([
transforms.Resize(299),
transforms.CenterCrop(299),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
}
print("Initializing Datasets and Dataloaders...")
# Create training and validation datasets
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'val', 'test']}
# Create training and validation dataloaders
print('batch size: ', batch_size)
dataloaders_dict = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=batch_size, shuffle=True, num_workers=4) for x in ['train', 'val', 'test']}
```

I am not sure what loader exactly is wrt the code I shared above. Any help is really appreciated.

and I want to use your code for getting the mean but I am not sure how exactly I could use it along with this loader from the tutorial which is dependent on `([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])`

from ImageNet. Also, do you set the same exact numbers for all the 3 sets of train, val, test? I mean the tutorial has done so but I am not sure if it is supposed to be correct necessarily especially if train/val/test are at a 60/20/20 ratio there might be some differences between their means.

So, are you computing it across the entire dataset or for each subset of train, val, and set?

P.S.: To me it rather makes sense why we have same number for mean std for all three of train, val, and set pretrained on imagenet and being used for fine-tuning on natural images. Since we assume natural images have a lot in common with ImageNet images. But in my case, I am not using natural images for fine-tuning and need to calculate the specific means and std for data transform.

If you want to compute the `mean`

and `stddev`

of the input images, you should not apply `Normalize`

to it but either compute these stats from the original inputs or after calling `ToTensor`

(which would normalize the data to `[0, 1]`

).

I would compute it from the training set, as I would consider calculating the stats from the val or test splits a data leak.

thanks for your response. Do you think this method makes sense? Finding mean and std for each of the train, val, and test dataloader to use for Normalize in data transform - #2 by Mona_Jalal

the explanation is clear and totally right,the average of partial standard deviation can not be seen as the global std.