Fail to use .cuda() in __getitem__()

While using PyHeat to analyze the time cost of my code, I find that the operation of Tensor.cuda() costs a lot, even more than that of using net to generate a result. Here is the cut of my PyHeat report:

I tried using Tensor.cuda() in the function of getitem in my dataset but failed. I also tried setting torch.multiprocessing.set_start_method(‘spawn’,force=True) and setting pin_memory=True num_worker=4 in my dataloader, all not work yet. (If num_worker is 1, everything works fine, but it’s too slow)

So I searched in PyTorch doc and only find things with how to put my model in multi gpus rather than put my data on one gpu with multiprocess. In the official examples of PyTorch, I still can’t find any code that use .cuda() in getitem in dataset. I don’t know whether there is something I missed in Doc, so I create a topic here for help.

Briefly, I want to put my data on one gpu with multiprocess and train my model on that gpu too, is there any solutions?


The timing associated with .cuda() operations can be misleading.
CUDA api is asynchronous and calling .cuda() is one of the sync point. This means that if you do something GPU intensive just before, it will appear like you did it during the .cuda() call.
You should run your code with CUDA_LAUNCH_BLOCKING=1 when measuring cpu runtime of gpu operations.

Thanks, I will test again with your advice, but my problem still not solved, since dataloader can load data using multiprocess, why not let dataloader put the data on gpu using multiprocess too?


Because sharing cuda tensors across processes is a nightmare and because of the async nature of cuda not needed.

Alright :joy:, by the way, where should I set CUDA_LAUNCH_BLOCKING=1?

This is an environment variable that should be set before cuda initialization. I usually do CUDA_LAUNCH_BLOCKING=1 python