DataLoader: is returning CUDA tensors always bad in distributed training?

yutanagano · March 1, 2022, 12:24am

Hi all,

I was reading through the pytorch documentation for the DataLoader class and I noticed that it recommends against DataLoaders returning CUDA tensors in distributed training.

It is generally not recommended to return CUDA tensors in multi-process loading because of many subtleties in using CUDA and sharing CUDA tensors in multiprocessing (see CUDA in multiprocessing). Instead, we recommend using automatic memory pinning (i.e., setting pin_memory=True), which enables fast data transfer to CUDA-enabled GPUs.

However, I also notice that it says “generally”. I’m wondering if my particular case is an exception.

I have written up a distributed training pipeline using the DistributedDataParallel framework. I have DataLoader objects that are instantiated in each child process spawned by torch.multiprocessing.spawn(). One CUDA device is allocated per child process, and the index of the device is passed to each of the processes so they know which device they should use. When the DataLoaders are created in each of these child processes, the DataLoaders also get told which devices they are operating on (I have written a custom DataLoader class) so that in the collate_fn of each of these data loaders, the tensors are created directly on their respective device.

In my case, I can’t see how creating the tensors directly on the CUDA device and returning that could be a bad thing. The dataloaders always know exactly which CUDA device the tensors need to be on, and since only one process uses each device there shouldn’t be any CUDA memory sharing problems right?

I just wanted to double check this with more knowledgable people here on the forum to make sure I haven’t completely misunderstood the point.

Cheers,
Yuta

ptrblck · March 1, 2022, 6:43am

I don’t think the quoted recommendation targets a distributed training setup, but the usage of multiprocessing e.g. via num_workers>0 in a DataLoader.

The main issues in such a use case is to avoid re-creating the CUDA context in each process, which will fail.

yutanagano · March 1, 2022, 10:55am

Ah, understood. Thank you for clarifying!