Calling loss.item() is very slow

When calling loss.item() in the “standard” training loop (e.g. here), the operation takes a lot of time (~2 seconds). Is there a way to reduce this bottleneck? I understand that the problem might be the loss being in the GPU…


The operation itself should not be slow, but is synchronizing the code, if you are using the GPU.
I.e. since CUDA operations are executed asynchronously, the timing will be accumulated in the next synchronizing operation, which is tensor.item(), since it needs to wait for the GPU calculation to finish in order to grab the value and push it to the CPU.

1 Like

Thanks, is there any way to speed this up (by e.g. printing loss asynchronously when the operation is finished instead of waiting for the operation to finish)?

Did you try with a num_workers=0. By doing that I think you will eliminate the need of synchronizing the code so it might be faster. For me it was the problem, but I had to do a lot of guess before finding the solution. It might not be your solution but maybe it will work! (num_workers is a parameter of your dataloader) (I have only been working with PyTorch for 6 months so I might do an error!)

No, there isn’t a way to speed this up, as you need to wait for the GPU to finish the loss calculation before you can print this value. In case the GPU isn’t finished yet, the print statement will wait for it.

If you do not print the value, but instead use a SummaryWriter for Tensorboard, is there a way to add the value to the writer in an asynchronous way (e.g. in a separate thread, when the GPU has finished fetching the loss, it will be added to the writer corresponding to the unique step specified in the add_scalar argument)?
Have not seen this anywhere but would be faster especially when training times are high.

If your goal is to print, for each epoch, the cumulative loss in order to analyze in TensorBoard/WANDB the performances of the model you might do something like this:

cumulative_loss = 0
with torch.set_grad_enabled(train):
    for batch_idx, (inputs, targets) in enumerate(data_loader):
      inputs, targets =, non_blocking=True),, non_blocking=True) 
      outputs = net(inputs) 
      loss = cost_function(outputs, targets) 
      loss2 = cost_function(outputs, targets)
      if train:
        loss.backward() # Backward pass
        optimizer.step() # Update parameters
        optimizer.zero_grad() # Resets the gradients

      #cumulative_loss += loss.item()  very slow
      cumulative_loss += loss2   #should not need synch

    cumulative_loss = cumulative_loss.item() #synch once per epoch instead of once per batch

After some experiments to me it seems that the performances are much better.

I’m not sure why you are calculating the loss twice, but note that appending the loss tensor without detaching it will increase the memory usage, since the entire computation graph is attached, and could yield out of memory errors (which users often call a “memory leak”), so you might want to call .detach() on loss2.


The reason is that I noticed that recomputing the loss is anyway faster than calling .item() on the loss. But actually you’re right. A cleaner version would replace the recomputation of the loss with simply .detach(), thanks.

Doesn’t calling loss.backward() detach the graph? I am getting the exact same time whether or not I call loss.detach(). (Also in my case, the time is not too different from just doing loss.item() every time.)

No, loss.backward() will free the intermediate activations, but won’t detach the tensor from the computation graph. You might not see the large memory accumulation, but the proper approach is to detach the loss before storing it for printing purposes.

I don’t think detaching the loss would result in a large speed difference, but would avoid accumulating memory.

This could mean that your code is already bottlenecks e.g. by other synchronizations.
Profile the code using the PyTorch profiler or e.g. Nsight systems to see where the botleneck in the code is.