What is the correct way to use torchvision.io.video.read_video with transforms.Normalize?

ThaiThien · December 12, 2020, 4:00pm

I load video frame by frame with

from torchvision.io.video import read_video
v, _, _ = read_video(video_path, pts_unit='sec')

Because I use each frame of video to predict on model train with image, I need to normalize.

transform = transforms.Compose([
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225]),
            ])
frame = transform(frame)

However, as video return uint8, it cannot normalize

ValueError: std evaluated to zero after conversion to torch.uint8, leading to division by zero.

I need to cast to float.

transform = transforms.Compose([
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225]),
            ])
frame = frame.float()
frame = transform(frame)

do the type cast here is the correct way to do so in this video case ?

For image case, I load with PIL and use transforms.ToTensor() so I don’t have to worry about int.

ptrblck · December 14, 2020, 10:24am

Wouldn’t the ToTensor() transformation also work for your uint8 video frames or are you seeing another error?
Just transforming to float() sounds wrong, as the values would still be in the range [0, 255], while you are using mean and std stats for inputs in the range [0, 1], which will be the value range after using ToTensor() in unit8 inputs.