Distributed DataLoader


I want to do some heavy data augmentations during training, which results into significant slow down of the training as even having 20-24 workers does not help in preparing a batch of size 32 so GPU would not stale. I am thinking about having an architecture where the data is prepared continuously on several worker nodes and then fed to the machine with GPUs for training. I am curious if there is any existing frameworks which allow doing something like this?

I have wondering the same thing. Did you find a solution?

I’m also looking for something like this.

From the docs

Most use cases involving batched inputs and multiple GPUs should default to using DataParallel to utilize more than one GPU. Even with the GIL, a single Python process can saturate multiple GPUs.

As of version 0.1.9, large numbers of GPUs (8+) might not be fully utilized. However, this is a known issue that is under active development. As always, test your use case.

That’s essentially what’s happening for me. I have a computer with many GPUs and the CPU becomes the bottleneck