Profiler: torch.cuda takes more time than expected

I am trying to optimize the performance using Spyder profiler. This line, which is part of the code, z = torch.tensor(x).float().cuda(), seems to take too much time, approximately 20 ms per image.

I’ve tried some other initialization ways; z =torch.cuda.FloatTensor(x), and
to.device(...); yet they both take slightly more time.

Image Info:
Image size on average, (x as a numpy array):
210,000 bytes (~70,000 bytes per channel).

Note: No batching is used

The profiler refers to this as the source of time depletion:
<method 'cuda' of 'torch._C._TensorBase' objects>

and probably this:

..\Anaconda3\envs\PyTorchEight\lib\site-packages\torch\cuda\__init__.py : 144
Lazy init

System info:
PyTorch 1.8.1; Python 3.8; CUDA Version: 11.2; Windows 10 pro. Everything works well and the GUP does what it is supposed to do. Installed via (Cu 11.2 vs 11.1):
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

I believe that method is both allocating space on the GPU for the tensor in question as well as copying it over to the device. It’s going to take longer than most other operations because it takes longer to copy from system memory to the GPU than to do operations entirely within memory.

1 Like

That’s correct. Might be better to load the images into torch tensors, and then copy them to the device.

I hope this worths the hassle, not sure how much speed up would that offer!

Well, if you can use the DataLoader class, then that offers pinned memory, which allows for faster transfers and multithreaded loading?

Yes. Just gave it a go and I noticed that torchvision image loading functions load the images to the CPU, and then one has to be copy them to the device (GPU). It would be nice if we can directly load to GPU. Not sure if this would be feasible.

You mean the GPU loading directly from a drive? Nvidia has a separate technology for that, but it specifically requires hardware support for PCIe P2P and workstation or data centre accelerator cards, along with limited software support (Ubuntu only, for instance). I don’t know if directly loading into the GPU without going through the CPU is doable in CUDA alone, though, but communicating from storage to GPUs is also possible because of texture streaming.

One thing to think about is if the 20ms it takes to transfer to the GPU is really going to be a bottleneck when working on more than one file compared to processing time.

1 Like