I’m trying to optimize my network and have been using
torch.utils.bottleneck. I’m now at the point in developing this that the profiler says my largest bottleneck is
I haven’t been able to find any documentation for this. What does this mean, exactly? And is this a good or bad sign in terms of optimization/GPU utilization?
to seems to me to point towards moving tensors from CPU to GPU, i.e.
a_cpu_tensor.to('cuda'). Are there a lot of device-related tensor instructions back and forth in your code?
Additionally to @karmus89 answer:
your code might run asynchronous CUDA operations and the
.to operation might create a synchronization point, so that the actual kernel times will be accumulated in the
@karmus89 my code doesn’t have any device-specific instructions. All relevant tensors are stored on a single CUDA device.
If there is a slow-down in my code, it’s likely due to either using a lot of indexing, or calling
clone(). Any chance cloning could show up as
to in the profiler?
Note: I’m not explicitly calling
to anywhere in my code