Torch.utils.bottleneck says that `to` takes the most GPU time. What does this mean?

Hi all,

I’m trying to optimize my network and have been using torch.utils.bottleneck. I’m now at the point in developing this that the profiler says my largest bottleneck is to.

I haven’t been able to find any documentation for this. What does this mean, exactly? And is this a good or bad sign in terms of optimization/GPU utilization?

Hi @noahtren!

The to seems to me to point towards moving tensors from CPU to GPU, i.e. a_cpu_tensor.to('cuda'). Are there a lot of device-related tensor instructions back and forth in your code?

1 Like

Additionally to @karmus89 answer:
your code might run asynchronous CUDA operations and the .to operation might create a synchronization point, so that the actual kernel times will be accumulated in the .to call.

@karmus89 my code doesn’t have any device-specific instructions. All relevant tensors are stored on a single CUDA device.

If there is a slow-down in my code, it’s likely due to either using a lot of indexing, or calling clone(). Any chance cloning could show up as to in the profiler?

Thanks.

Note: I’m not explicitly calling to anywhere in my code