Getting negative mean and std for a subset of my dataset

Mona_Jalal · March 29, 2022, 2:14am

I have created a subset of my data (a bit smaller) and now I have negative values in my std and mean tensors:

This subset is designed in a way that I would have same number of positive and negative classes.

train mean and std: tensor([0.0050, 0.0225, 0.0250]) tensor([0.9833, 0.9977, 0.9932])
val mean and std: tensor([-0.0584, -0.0225, -0.0385]) tensor([1.0157, 1.0408, 1.0215])
test mean and std: tensor([-0.1491, -0.0664, -0.0715]) tensor([1.0436, 1.0221, 1.0180])

Is any part of the code below wrong?

# get the mean var std of train, test and val set for data transform
def get_mean_std(loader):
    # VAR[X] = E[X**2] - E[X]**2
    channels_sum, channels_squared_sum, num_batches = 0, 0, 0
    for data, _ in loader:
        channels_sum += torch.mean(data, dim=[0,2,3])
        channels_squared_sum += torch.mean(data**2, dim=[0,2,3])
        num_batches += 1
    
    mean = channels_sum/num_batches
    std = (channels_squared_sum/num_batches - mean**2)**0.5
    return mean, std

train_mean, train_std = get_mean_std(dataloaders_dict['train'])
print(train_mean, train_std)
test_mean, test_std = get_mean_std(dataloaders_dict['test'])
print(test_mean, test_std)
val_mean, val_std = get_mean_std(dataloaders_dict['val'])
print(val_mean, val_std)

Here are the values I had for the larget set of my dataset:

data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(input_size),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        transforms.Normalize([0.7031, 0.5487, 0.6750], [0.2115, 0.2581, 0.1952])
    ]),
    'val': transforms.Compose([
        transforms.Resize(input_size),
        transforms.CenterCrop(input_size),
        transforms.ToTensor(),
        #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        transforms.Normalize([0.7016, 0.5549, 0.6784], [0.2099, 0.2583, 0.1998])
    ]),
    
    'test': transforms.Compose([
        transforms.Resize(input_size),
        transforms.CenterCrop(input_size),
        transforms.ToTensor(),
        #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        transforms.Normalize([0.7048, 0.5509, 0.6763], [0.2111, 0.2576, 0.1979])
    ])
}

I also understand that I should only use the train mean and std also for val and test and don’t calculate it for those since it would leak into val and test based on what @ptrblck mentioned in a previous post. But I don’t understand how it happens. Is there any scientific paper talking about this phenomena?

Mona_Jalal · March 29, 2022, 2:33am

I fixed this problem and I set it as solution. However, my question on @ptrblck comment is still present. I would like to learn mathematically or with a proof why we want to use the mean and std value of train set for normalization of val and test set too. Any feedback is really appreciated

train mean and std: tensor([0.7057, 0.5569, 0.6816]) tensor([0.2047, 0.2554, 0.1914])
val mean and std: tensor([0.6925, 0.5451, 0.6686]) tensor([0.2145, 0.2682, 0.2022])
test mean and std: tensor([0.6702, 0.5377, 0.6640]) tensor([0.2191, 0.2640, 0.2034])

I had forgotten to comment the normalize values for my larger dataset – very dumb sorry

data_transforms = {
    'train': transforms.Compose([
        #transforms.RandomResizedCrop(input_size),
        transforms.Resize((input_size, input_size)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize((input_size, input_size)),
        #transforms.CenterCrop(input_size),
        transforms.ToTensor(),
        #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    
    'test': transforms.Compose([
        transforms.Resize((input_size, input_size)),
        #transforms.CenterCrop(input_size),
        transforms.ToTensor(),
        #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])
}

Mona_Jalal · March 29, 2022, 2:37am

Could you please confirm if I should use this values for Normalize?

data_transforms = {
    'train': transforms.Compose([
        #transforms.RandomResizedCrop(input_size),
        transforms.Resize((input_size, input_size)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        transforms.Normalize([0.7057, 0.5569, 0.6816], [0.2047, 0.2554, 0.1914])
    ]),
    'val': transforms.Compose([
        transforms.Resize((input_size, input_size)),
        #transforms.CenterCrop(input_size),
        transforms.ToTensor(),
        #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        transforms.Normalize([0.7057, 0.5569, 0.6816], [0.2047, 0.2554, 0.1914])
    ]),
    
    'test': transforms.Compose([
        transforms.Resize((input_size, input_size)),
        #transforms.CenterCrop(input_size),
        transforms.ToTensor(),
        #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        transforms.Normalize([0.7057, 0.5569, 0.6816], [0.2047, 0.2554, 0.1914])
    ])
}

ptrblck · March 29, 2022, 4:49am

The idea is that the val and test sets are “new” or “unseen” data.
If you are using these splits to calculate any stats or use them in any other way, they are not new anymore as you are leaking the data information into your training.
This is also why a validation and test set is needed.
Even though you are not training with the validation dataset, you can still use it for e.g. early stopping, which uses some information about the validation data for the training.
The same applies for the normalization. Think about the real use case of deploying the model where you would usually not have access to the new data which will be passed to your model.

Mona_Jalal · March 29, 2022, 9:40pm

Thank you so much for your explanation. This is very helpful.