I am profiling to my code in the training loop during a single forward pass like the following:
with torch.autograd.profiler.profile(use_cuda=False) as prof:
y = model(x)
print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=25))
The top entry in the printed table is the function to with 32% self CPU total usage and 4500 calls. This is weird to me since nowhere in my model code I call the .to(device) function (only earlier on before the profiling is started).
How can I find out which PyTorch call may internally calls .to(device) without me doing this explicitly?
Cheers
Edit: I just found the reason. When my model is already on a cuda device and the forward pass is being called in a profiling context where use_cuda=False, all Tensors are implicitly sent back to the CPU. The above does not happen when my model is not on a cuda device first.
Edit 2: Nevermind, the to usage is still high when using use_cuda=True and everything is already on the device a priori.
No the model is not wrapped in DataParallel (or anything else).
Which details do you need? I can’t publish the model code (yet) unfortunately, sorry.
It’s a deep model using multiple custom layers performing lots of sum, matmul, exp, einsum and log operations. Nothing fancy per se. We mainly use pytorch to obtain the gradients via autograd.
The thing is that internally. We try and never call .to under the hook to avoid performance pitfalls. DataParallel is an exception has it needs to move stuff between gpus.
So one forward in your model still calls 4500 times .to() ?
Note as well that if you do many .float() or .long(), etc, they get converted to .to operators I think.
Ho the doc actually shows their equivalent .to() formulation.
If you set use_cuda=True then every operation will block on the GPU. So you can see how long they take.
These calls make a full copy of the given Tensor every time they’re called. So if you do too many of them, they can slow you down. Especially if your overall task is memory bounded.