Torch.utils.bottleneck says that `to` takes the most GPU time. What does this mean?

noahtren · September 12, 2019, 12:42am

Hi all,

I’m trying to optimize my network and have been using torch.utils.bottleneck. I’m now at the point in developing this that the profiler says my largest bottleneck is to.

I haven’t been able to find any documentation for this. What does this mean, exactly? And is this a good or bad sign in terms of optimization/GPU utilization?

karmus89 · September 12, 2019, 9:59am

Hi @noahtren!

The to seems to me to point towards moving tensors from CPU to GPU, i.e. a_cpu_tensor.to('cuda'). Are there a lot of device-related tensor instructions back and forth in your code?

ptrblck · September 12, 2019, 3:47pm

Additionally to @karmus89 answer:
your code might run asynchronous CUDA operations and the .to operation might create a synchronization point, so that the actual kernel times will be accumulated in the .to call.

noahtren · September 13, 2019, 11:22pm

@karmus89 my code doesn’t have any device-specific instructions. All relevant tensors are stored on a single CUDA device.

If there is a slow-down in my code, it’s likely due to either using a lot of indexing, or calling clone(). Any chance cloning could show up as to in the profiler?

Thanks.

Note: I’m not explicitly calling to anywhere in my code