.item() before .backward() makes execution much slower

I have a following piece of code (part of a train function):

loss = criterion(output, target)
loss_scalar = loss.item()

I run it on GPU without DataParallel. I have the latest version of PyTorch 1.0.1.post2 with CUDA 10.

If I run it in this order, execution time is 40 seconds. However, if I put loss.item() before loss.backward(), my execution time blows up to 200 seconds. Most of the time is then spent in .backward() as I can see from my profiler.

What would be the reason for that? Is there any preferred order of converting loss to scalar?

I think the loss.item() op creates a synchronization point so that your script would have to wait for the CUDA kernels to finish their execution.
Apparently if you call it directly after loss.backward() some kernels cannot run in background and you see a worse performance.