Using non_blocking=True with multiprocessing for efficient GPU pre-fetch

Hello. I am currently trying to implement asynchronous GPU pre-fetching with Pytorch’s DataLoader.

I am doing this because data I/O from disk is a major bottleneck in my task. I therefore want to have a dedicated process reading data from disk at all times, with no gaps.

The idea that I have now is to transfer the data to GPU in Pytorch Dataset’s __getitem__ method. Inside __getitem__, I have code that looks like the following.

tensor = tensor.to('cuda:1', non_blocking=True) (I have 2 GPUs)

Meanwhile, I am using Pytorch’s multiprocessing as follows.

multiprocessing.set_start_method(method='spawn').

This line comes before the DataLoaders are initialized.

Also, I have set DataLoader(num_workers=1).

It is my hope that this will allow the DataLoader will create a dedicated process that will transfer data from host (CPU) to device (GPU) while it is reading data from disk. Meanwhile, memory transfer should be non-blocking so that the worker created by the DataLoader can read data from disk at all times without having to wait while data is being transferred from CPU to GPU through the PCIe bus, which is a high latency data transfer in its own right.

However, I am not at all sure whether this is how Pytorch is doing it. There are several things I do not know.

First, does non_blocking=True work inside DataLoader when multiprocessing is being used? I do not know how CUDA memcpyAsync works with multiprocessing. This is especially confusing since data is being transferred between processes too. I do not know when the Pytorch implementation of non_blocking=True is triggered to block.

Second, does Pytorch DataLoader wait until all processes are finished with their work at each iteration? If so, does it wait until data transfer is finished? I do not see why this should be the case, but I do not know if transferring tensors between processes causes such behavior.

I should mention that I am using a very simple custom collate_fn in Dataloader that just returns the input unpacked into its separate components, without transferring to shared memory, etc. I found that this was necessary for GPU pre-fetching on multiple GPUs.

Finally, if the method that I have proposed does not work, is there any way to implement having a dedicated disk reading process using Pytorch’s DataLoader? Hopefully it should not be interrupted in its data reading task.

Many thanks in advance for anyone who can help out. I know that this requires a lot of in-depth Pytorch and CUDA knowledge. But I think that many people would be interested since data I/O is often a problem for large datasets, especially for those of us who cannot afford specialized hardware at the necessary scale.