Can I set num_workers in torchvision.datasets.Kinetics400?

Hi,
I am trying to pre-compute the kinetics dataset.
I try it with

dataset = torchvision.datasets.Kinetics400(
            traindir,
            frames_per_clip=250,
            step_between_clips=200,
            transform=transform_train,
            frame_rate=30,
            extensions=('avi', 'mp4', )
        )

But it took like ~160hours.
I wonder if I can set num_workers, like dataloaders, to speed up.
Thanks.

You should be able to pass this dataset to a DataLoader and set the number of workers there, which would use multiple processes to create the batches or are you seeing any errors with it?

Hi, thanks for reply. No, no errors. I understand what you mean. The problem for me is at this step for generating the torchvision.datasets.Kinetics400 instance (before feed it to the dataloader), it took too long and exceed the time limit for my computing cluster. But I just realized I can add num_workers, like

dataset = torchvision.datasets.Kinetics400(
            traindir,
            frames_per_clip=250,
            step_between_clips=200,
            transform=transform_train,
            frame_rate=30,
            extensions=('avi', 'mp4', ),
            num_workers=4
        )

And it seems to be able to run smoothly, so I am testing if it helps.
But still wonder if it make sense to add the num_workers argument here.

Yes, it seems you are right and don’t need to wrap this Dataset into a DataLoader.

While the num_workers argument is shown in the docs, it’s unfortunately not explained in the parameters.

Inside the Kinetics400 dataset, a VideoClips object will be created, which accepts the num_workers argument as seen here.
Internally a DataLoader is created using the num_workers argument as seen here.

So it seems this “Dataset” reverts the logic of passing a Dataset instance into a DataLoader and uses a DataLoader instead internally.
I’m not sure, if this approach is used for all video datasets.

I understand, thank you very much.