I am profiling to my code in the training loop during a single forward pass like the following:
with torch.autograd.profiler.profile(use_cuda=False) as prof: y = model(x) print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=25))
The top entry in the printed table is the function
to with 32% self CPU total usage and 4500 calls. This is weird to me since nowhere in my model code I call the
.to(device) function (only earlier on before the profiling is started).
Here is the output: https://pastebin.com/raw/Y6Yv3FGe
How can I find out which PyTorch call may internally calls
.to(device) without me doing this explicitly?
I just found the reason. When my model is already on a cuda device and the forward pass is being called in a profiling context where
use_cuda=False, all Tensors are implicitly sent back to the CPU. The above does not happen when my model is not on a cuda device first.
Edit 2: Nevermind, the
to usage is still high when using
use_cuda=True and everything is already on the device a priori.