Torchvision transforms nomalize value selection

111429 · December 19, 2020, 1:50pm

Hi,
This is the value for normalizing the kinetics dataset for video classification in the official script.

pytorch/vision/blob/master/references/video_classification/train.py#L115


print("torchvision version: ", torchvision.__version__)
device = torch.device(args.device)
torch.backends.cudnn.benchmark = True
# Data loading code
print("Loading data")
traindir = os.path.join(args.data_path, args.train_dir)
valdir = os.path.join(args.data_path, args.val_dir)
normalize = T.Normalize(mean=[0.43216, 0.394666, 0.37645],
                        std=[0.22803, 0.22145, 0.216989])
print("Loading training data")
st = time.time()
cache_path = _get_cache_path(traindir)
transform_train = torchvision.transforms.Compose([
    ConvertBHWCtoBCHW(),
    T.ConvertImageDtype(torch.float32),
    T.Resize((128, 171)),
    T.RandomHorizontalFlip(),

i.e.,
normalize = T.Normalize(mean=[0.43216, 0.394666, 0.37645],
std=[0.22803, 0.22145, 0.216989])

And in https://pytorch.org/docs/stable/torchvision/models.html
it says that “All pre-trained models expect input images normalized in the same way”
with this norm transform:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])

My case is to use pretrained resnet in torchvision to extract feature on kinetics video frames.
So which norm should I choose?
And how these values come from?
Thanks.

ptrblck · December 20, 2020, 7:25am

I guess the stats used in the linked script would have been calculated from the Kinetics dataset, while the pretrained (classification) models were trained on ImageNet using the mentioned stats in the docs.