I’m trying to optimize my network and have been using torch.utils.bottleneck. I’m now at the point in developing this that the profiler says my largest bottleneck is to.
I haven’t been able to find any documentation for this. What does this mean, exactly? And is this a good or bad sign in terms of optimization/GPU utilization?
The to seems to me to point towards moving tensors from CPU to GPU, i.e. a_cpu_tensor.to('cuda'). Are there a lot of device-related tensor instructions back and forth in your code?
Additionally to @karmus89 answer:
your code might run asynchronous CUDA operations and the .to operation might create a synchronization point, so that the actual kernel times will be accumulated in the .to call.
@karmus89 my code doesn’t have any device-specific instructions. All relevant tensors are stored on a single CUDA device.
If there is a slow-down in my code, it’s likely due to either using a lot of indexing, or calling clone(). Any chance cloning could show up as to in the profiler?
Thanks.
Note: I’m not explicitly calling to anywhere in my code