Video Classification using UCF-101 dataset

yakhyo · July 4, 2022, 5:09am

I am trying to use video classifcation from torchvision models. The official code uses kinetics dataset however when I try to use UCF-101 dataset I am getting these runtime errors. Link to the train code:

github.com

pytorch/vision/blob/main/references/video_classification/train.py

import datetime
import os
import time
import warnings

import presets
import torch
import torch.utils.data
import torchvision
import torchvision.datasets.video_utils
import utils
from torch import nn
from torch.utils.data.dataloader import default_collate
from torchvision.datasets.samplers import DistributedSampler, UniformClipSampler, RandomClipSampler


def train_one_epoch(model, criterion, optimizer, lr_scheduler, data_loader, device, epoch, print_freq, scaler=None):
    model.train()
    metric_logger = utils.MetricLogger(delimiter="  ")
    metric_logger.add_meter("lr", utils.SmoothedValue(window_size=1, fmt="{value}"))

This file has been truncated. show original

Thanks in advace

torch              1.12.0+cu113
torchaudio         0.12.0+cu113
torchvision        0.13.0+cu113

Error:

Traceback (most recent call last):
  File "/home/pyler/PycharmProjects/res-ufc101/train.py", line 388, in <module>
    main(args)
  File "/home/pyler/PycharmProjects/res-ufc101/train.py", line 287, in main
    train_one_epoch(model, criterion, optimizer, lr_scheduler, data_loader, device, epoch, args.print_freq, scaler)
  File "/home/pyler/PycharmProjects/res-ufc101/train.py", line 24, in train_one_epoch
    for video, target in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/pyler/PycharmProjects/res-ufc101/utils.py", line 127, in log_every
    for obj in iterable:
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in __next__
    data = self._next_data()
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1347, in _next_data
    return self._process_data(data)
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1373, in _process_data
    data.reraise()
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torchvision/datasets/ucf101.py", line 128, in __getitem__
    video = self.transform(video)
  File "/home/pyler/PycharmProjects/res-ufc101/presets.py", line 26, in __call__
    return self.transforms(x)
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torchvision/transforms/transforms.py", line 94, in __call__
    img = t(img)
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torchvision/transforms/transforms.py", line 269, in forward
    return F.normalize(tensor, self.mean, self.std, self.inplace)
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torchvision/transforms/functional.py", line 360, in normalize
    return F_t.normalize(tensor, mean=mean, std=std, inplace=inplace)
  File "/home/pyler/Python/envs/torch/lib/python3.8/site-packages/torchvision/transforms/functional_tensor.py", line 959, in normalize
    tensor.sub_(mean).div_(std)
RuntimeError: The size of tensor a (240) must match the size of tensor b (3) at non-singleton dimension 1

ptrblck · July 4, 2022, 8:08pm

Based on this error:

RuntimeError: The size of tensor a (240) must match the size of tensor b (3) at non-singleton dimension 1

I would guess you might be passing the input tensors in a channels-last format while channels first [batch_size, channels, height, width] is expected. Could you check if this is the case?

yakhyo · July 5, 2022, 2:23am

Thanks @ptrblck for your answer but it could not resolve my issue.
I only changed the dataset and some of its parameters. But the same problem still exists. Original code for kinetics dataset:

        dataset = torchvision.datasets.Kinetics(
            args.data_path,
            frames_per_clip=args.clip_len,
            num_classes=args.kinetics_version,
            split="train",
            step_between_clips=1,
            transform=transform_train,
            frame_rate=args.frame_rate,
            extensions=(
                "avi",
                "mp4",
            ),
            output_format="TCHW",
        )

I changed it to the following:

dataset = torchvision.datasets.UCF101(
            args.data_path,
            annotation_path='../../Datasets/UCF-101/annotations',
            frames_per_clip=args.clip_len,
            train=True,
            step_between_clips=1,
            transform=transform_train,
            frame_rate=args.frame_rate,
            output_format="TCHW",
        )

ptrblck · July 5, 2022, 6:32am

Print the shape of x in line 26 in /home/pyler/PycharmProjects/res-ufc101/presets.py and check what it’s returning. I would still guess that the memory format might be wrong and thus the transformation fails.

yakhyo · July 5, 2022, 9:45am

Yes, the problem is coming that transformation x = {Tensor: (16, 240, 320, 3)}}
. However I specified the output_format for the dataset as TCHW while creating a dataset but did not work.

ptrblck · July 5, 2022, 4:17pm

I don’t think the output_format would fix the issue, as the transformation is expected to work on [T, H, W, C] frames as seen in the docs:

transform (callable , optional) – A function/transform that takes in a TxHxWxC video and returns a transformed version.

If your transformation doesn’t support it, you could permute the data inside the transform via:

...
transforms.Lambda(lambda x: x.permute(0, 3, 1, 2)),
...

yakhyo · July 6, 2022, 6:05am

Thank you so much, it worked for me.