This is the value for normalizing the kinetics dataset for video classification in the official script.
print("torchvision version: ", torchvision.__version__) device = torch.device(args.device) torch.backends.cudnn.benchmark = True # Data loading code print("Loading data") traindir = os.path.join(args.data_path, args.train_dir) valdir = os.path.join(args.data_path, args.val_dir) normalize = T.Normalize(mean=[0.43216, 0.394666, 0.37645], std=[0.22803, 0.22145, 0.216989]) print("Loading training data") st = time.time() cache_path = _get_cache_path(traindir) transform_train = torchvision.transforms.Compose([ ConvertBHWCtoBCHW(), T.ConvertImageDtype(torch.float32), T.Resize((128, 171)), T.RandomHorizontalFlip(),
normalize = T.Normalize(mean=[0.43216, 0.394666, 0.37645],
std=[0.22803, 0.22145, 0.216989])
it says that “All pre-trained models expect input images normalized in the same way”
with this norm transform:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
My case is to use pretrained resnet in torchvision to extract feature on kinetics video frames.
So which norm should I choose?
And how these values come from?
I guess the stats used in the linked script would have been calculated from the Kinetics dataset, while the pretrained (classification) models were trained on ImageNet using the mentioned stats in the docs.