I face an unsolvable problem and looking for any advice here…
In my use case, I have a special model M that processes the input images in the dataloader. And the model is quite huge, so it always requires GPU execution speed up.
In this case, I have two solutions:
- Straightforward one. Use the main thread of the dataloader only. (It works, but very slow.)
- Run the dataloader with multiple workers. Use
torch.multiprocess.set_start_method("spawn")to let the child processes to acquire cuda. (15 times slower that 1…)
As execution time is critical in my case, so I keep finding faster implementation.
I originally believe that 2. should be faster.
But in reality, it runs 15 times slower than running with main thread only.
It seems like “spawn” backend significantly slows down the child process creation and the dataloader does not reuse the used child process (I can see the GPU memory usage goes ups and downs).
Any suggestion would be a great help!
I also tried to use
M.share_memory to share the model M among child processes, but it seems do not affect the execution speed at all.