I was reading through the pytorch documentation for the DataLoader class and I noticed that it recommends against DataLoaders returning CUDA tensors in distributed training.
It is generally not recommended to return CUDA tensors in multi-process loading because of many subtleties in using CUDA and sharing CUDA tensors in multiprocessing (see CUDA in multiprocessing). Instead, we recommend using automatic memory pinning (i.e., setting
pin_memory=True), which enables fast data transfer to CUDA-enabled GPUs.
However, I also notice that it says “generally”. I’m wondering if my particular case is an exception.
I have written up a distributed training pipeline using the
DistributedDataParallel framework. I have DataLoader objects that are instantiated in each child process spawned by
torch.multiprocessing.spawn(). One CUDA device is allocated per child process, and the index of the device is passed to each of the processes so they know which device they should use. When the DataLoaders are created in each of these child processes, the DataLoaders also get told which devices they are operating on (I have written a custom DataLoader class) so that in the
collate_fn of each of these data loaders, the tensors are created directly on their respective device.
In my case, I can’t see how creating the tensors directly on the CUDA device and returning that could be a bad thing. The dataloaders always know exactly which CUDA device the tensors need to be on, and since only one process uses each device there shouldn’t be any CUDA memory sharing problems right?
I just wanted to double check this with more knowledgable people here on the forum to make sure I haven’t completely misunderstood the point.