Choice of mean and variance when normalizing the dataset

Problem

Normalizing dataset is a common component of machine learning before doing any downstream task. However, when I learn the tutorial in PyTorch, the author used mysterious mean and variance value for different channels of RGB images, i.e. 0.485, 0.456 and 0.406 for mean and 0.229, 0.224, 0.225 for variance (see code below).

I am not sure how those somehow magic numbers come about. Do they come from some statistics of the training dataset which is not disclosed in tutorial. Or maybe they are heuristics that prove to be effective in computer vision literature?

# Data augmentation and normalization for training
# Just normalization for validation
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

They are image net statistics. You ideally should compute statistics for each dataset you use.

Thank you for your timely response!

I found the origin of those magic numbers in torchvision.models doc, which, as you said, should be statistics of ImageNet.

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]

However, it also said as long as users would like to use those pretrained models, they must normalize their images in this way. I think this makes sense since those models were trained on data scaled according to those specified mean and std.