Is there an off-the-shelf video data loader to use?


Hi all,

I was wondering is there an off-the-shelf video data loader to use? I mean, if the structure of the dataset is like:

A root folder containing multiple class directories, and each class contain multiple video clip directories, and each video contain a list of continuous frames.

This dataset structure is quite common for most academic video dataset. If there is no such data loader, could anyone point out which files should I look at and modify? Thank you very much.


you can copy over this file and modify the image loading to video loading:

(Chih-Yao Ma) #3

As @smth pointed out, you should start with

Below are the major steps that I think you might need to go through.

  1. make_dataset()
    You probably need to make some changes to the make_dataset() function, which is basically listing all the video frame files. If you need to group a number of video frames and tie them with a video, you probably want to do that in here (when listing the video frames)
  2. _ _ getitem _ _
    The second part will be how you collect the items. This can be straight forward or can be quite complicated. The easiest way to do is: given a video id (obtained by using index and the list you just collected in step one), grab all the video frames associated with that particular video id. Note that, in this way, your data loading time will be considerably longer, and your system memory might explode because you are loading tens or hundreds of images per video into system memory before you pass it to GPU for training.
  3. YourOwn_collate_fn()
    By default, the items collected in _ _ getitem _ _ will be combined into batch by default_collate. This might no longer be the case when you have a list or stack of video frames. The easiest way I can think of is to create your own collate function and reformating the data you collected so that you can call default_collate within your own collate function.

These are just some of my thoughts and I am also very interested in having an off-the-shelf video dataloader and having some example codes so that people can refer to.

(Pipehappy1) #4

Any update on this one? It would be nice to have a built-in video data loader instead of converting them into images first…

(Kiran Vaidhya) #5

Are there any updates on this? Quick data reading is essential for video processing.

(Ross Wightman) #6

It would certainly be a nice to have, but it’s actually quite challenging to do this right and do it efficiently so you’re not spinning 16 cores at 100% feeding one model and consuming 100s of GB of memory. I have a lot of experience storing, streaming video with other synchronized streams in a past life and it’s a rather big pain in the ass to say the least.

You would ideally want to leverage the video decoders on the Nvidia card, and even more ideally keep the decoded buffers on the card, a zero-copy to Tensors ideally. It looks like Nvidia has done the basics of something like that with their DeepStream SDK but I have’t looked closely enough to see the details, they may have made shortcuts given that they claim to be feeding standard image models that don’t have a time axis.

Aside from leveraging hardware decoding, the timestamping of video, and corresponding streams that you may want to keep in synch such as audio, text, other metadata can be quite challenging. At a codec level, advance video codecs have their frames out of order, the amount of video you have to decode to get a sequence of frames in the correct order varies depending on codec and chosen encode parameters. It can be quite significant which could cause problems with buffering and memory usage. Most surveillance systems get around this by enforcing maximum keyframe intervals based on a users desired latency to storage usage tradeoff, but typical media from streaming sources or disks like blu-rays, Netflix, etc have very long keyframe intervals to reduce bandwidth since random access is minimal and people are okay waiting a few seconds for the stream to resume after changing position. If you are doing this constantly to get fixed time windows, it would be a huge overhead.

Then you’d have to figure out how to represent your targets, what are they? how are they represented relative to a stream of images with a time axis. I assume there would be a lot of possible variety and different wants here.

And I thought about all this because I looked at doing it, but it will have to wait until a possible project is viable that requires this as it is a significant undertaking :slight_smile:

(Md Asif Jalal) #7

I am trying to make a video dataloader in the approach but I am getting this error

Traceback (most recent call last):
File “”, line 59, in
for batch_idx, (data,target) in enumerate(train_loader):
File “/home/acp15maj/.conda/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/”, line 201, in next
return self._process_next_batch(batch)
File “/home/acp15maj/.conda/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/”, line 221, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File “/home/acp15maj/.conda/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/”, line 40, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/home/acp15maj/.conda/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/”, line 109, in default_collate
return [default_collate(samples) for samples in transposed]
File “/home/acp15maj/.conda/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/”, line 109, in
return [default_collate(samples) for samples in transposed]
File “/home/acp15maj/.conda/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/”, line 95, in default_collate
return torch.stack([torch.from_numpy(b) for b in batch], 0)
File “/home/acp15maj/.conda/envs/pytorch/lib/python3.5/site-packages/torch/”, line 64, in stack
return, dim)
RuntimeError: inconsistent tensor sizes at /opt/conda/conda-bld/pytorch_1503968623488/work/torch/lib/TH/generic/THTensorMath.c:2709

My code is

class CustomDataset(Dataset):
    def __init__(self,csv_path,transform=None):
        # TODO
        # 1. Initialize file path or list of file names.
        # full file path is given in the .csv file
        temp_df = pd.read_csv(csv_path)

    def __getitem__(self, index):
        # numpy array of video file (50,3,200,112) shape
        if self.transform is not None:
        return video_tensor,target_labels

    def __len__(self):
        return len(self.X_train.index)

if __name__ == '__main__':
    # Then, you can just use prebuilt torch's data loader.
    custom_dataset = CustomDataset('trainlist01.csv',transform=None)
    train_loader =,
    for batch_idx, (data,target) in enumerate(train_loader):
        data, target = data.cuda(async=True), target.cuda(async=True)

Please pardon me if it is a silly question. :slightly_smiling_face: