Running cuda in dataloader is slow


I face an unsolvable problem and looking for any advice here…

In my use case, I have a special model M that processes the input images in the dataloader. And the model is quite huge, so it always requires GPU execution speed up.
In this case, I have two solutions:

  1. Straightforward one. Use the main thread of the dataloader only. (It works, but very slow.)
  2. Run the dataloader with multiple workers. Use torch.multiprocess.set_start_method("spawn") to let the child processes to acquire cuda. (15 times slower that 1…)

As execution time is critical in my case, so I keep finding faster implementation.
I originally believe that 2. should be faster.
But in reality, it runs 15 times slower than running with main thread only.
It seems like “spawn” backend significantly slows down the child process creation and the dataloader does not reuse the used child process (I can see the GPU memory usage goes ups and downs).

Any suggestion would be a great help!
Sincerely thanks!

I also tried to use M.share_memory to share the model M among child processes, but it seems do not affect the execution speed at all.

1 Like

Yeah, I would avoid (2). Accessing the GPU from dataloader workers is the path to ruin.

You don’t have to do all your preprocessing in the dataloader. For example, you can do your file load and CPU pre-processing in the data loader, but do the GPU operations afterwards:

for sample_cpu in dataloader:
   sample_gpu = preprocess_gpu(sample_cpu)
   train(sample_gpu) # or whatever

In general, the way to make GPU operations fast is:

  1. batch operations
  2. avoid CPU-GPU synchronizations
  3. make sure the underlying ops are efficient
1 Like

Thanks for your reply!

But my use case (I’m trying model-based data augmentation) strictly constraints that each sample is processed differently.
In this case, as I can’t process the samples in batch-wise, I believe that running multiprocessing is the last hope for me ;(