Why not multiprocess pin_memory in data loader?

mengwanguc · February 19, 2024, 9:34pm

In PyTorch data loader, there is a separate thread for pinning memory.
However, pinning memory can be CPU-intensive as it needs to copy the tensors. When the tensors are very large and model computations are very fast, this pin_memory thread might become the bottleneck.

Why don’t we do multiprocessing for pin_memory as well, just like for workers?
Is there any specific reason, or challenge?

albanD · February 20, 2024, 4:33pm

Hi!

This is because, to send the Tensor between threads, we need to put it in shared memory so that both processes can access it. Unfortunately, it is not possible to have memory that is both shared and pinned. So we need to send it to the final process first and only then can be pin it.

If it is a bottleneck for you, it should be relatively simple to have multiple pinning threads in the main process to speed it up when you have a lot of worker threads.