Autograd profiler shows high usage of "to"

I am profiling to my code in the training loop during a single forward pass like the following:

with torch.autograd.profiler.profile(use_cuda=False) as prof:
    y = model(x)
print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=25))

The top entry in the printed table is the function to with 32% self CPU total usage and 4500 calls. This is weird to me since nowhere in my model code I call the .to(device) function (only earlier on before the profiling is started).

Here is the output: https://pastebin.com/raw/Y6Yv3FGe

How can I find out which PyTorch call may internally calls .to(device) without me doing this explicitly?

Cheers

Edit: I just found the reason. When my model is already on a cuda device and the forward pass is being called in a profiling context where use_cuda=False, all Tensors are implicitly sent back to the CPU. The above does not happen when my model is not on a cuda device first.
Edit 2: Nevermind, the to usage is still high when using use_cuda=True and everything is already on the device a priori.

Hi,

use_cuda wil only impact the profiler: it will only allow the profiler to properly time cuda calls or not.

Is your model wrapped in a DataParallel or similar construct? Could you give more details about your model?

No the model is not wrapped in DataParallel (or anything else).

Which details do you need? I can’t publish the model code (yet) unfortunately, sorry.
It’s a deep model using multiple custom layers performing lots of sum, matmul, exp, einsum and log operations. Nothing fancy per se. We mainly use pytorch to obtain the gradients via autograd.

The thing is that internally. We try and never call .to under the hook to avoid performance pitfalls. DataParallel is an exception has it needs to move stuff between gpus.

So one forward in your model still calls 4500 times .to() ?
Note as well that if you do many .float() or .long(), etc, they get converted to .to operators I think.

We call .float() and .long() quite a few times. That explains it. I didn’t that that it calls .to internally (see https://pytorch.org/docs/stable/tensors.html#torch.Tensor.long).

Do you happen to know why these calls bother the cpu so much? The tensors they are called upon all reside on a cuda device.

Ho the doc actually shows their equivalent .to() formulation.

If you set use_cuda=True then every operation will block on the GPU. So you can see how long they take.
These calls make a full copy of the given Tensor every time they’re called. So if you do too many of them, they can slow you down. Especially if your overall task is memory bounded.

That’s good to know. Thanks for your support!

Well, I have to correct my above statement. We neither call .float() nor .long() in the forward pass that is being profiled.

Is there any way to trace, where a .to() call is being done internally?

I don’t think there is.
You can try to profiler subsets of your model to pinpoint the issue and send the problematic part if it is small enough.