Reducing the memory footprint for multiple loss functions

Kimonili · December 8, 2021, 12:20pm

Hi all,

Say I calculate 4 losses for each batch as shown below:

losses_list = []
some_losses_fns = [loss_fn1,loss_fn2,loss_fn3,loss_fn4]
for loss_fn in some_losses_fns:
  # assume loss is computed 
  loss = loss_fn(preds, true_labels)
  losses_list.append(loss.item())
average_loss = torch.cat(losses_list).mean()
average_loss.requires_grad_()

Are the gradients of the losses going to be backpropagated properly if when I call .item() on them and then call requires_grad_() on the average loss?

I am trying to reduce the memory footprint of my model and I found this suggestion online but I wanted to double check whether it is legit. Any other suggestions? Maybe del the loss variable after it is appended to the losses_list?

Thanks in advance!

Azerus · December 8, 2021, 12:53pm

Hello,

In short, it won’t work using item. You need to keep tensors all along, else you’ll lose the computational graph (DAG) which is used by the autograd when calling backward (see autograd mechanism in the doc). item() method return a standard python number, which is indeed not compatible with backward.

You can safely use del on the loss after it is appended to the losses_list, but my guess is you won’t notice any significant change in memory footprint.

I don’t see any way to reduce the memory footprint on this part of your code.

Regards,
Thomas

[Edit] By the way, look to the documentation for the item method, it clearly states “This operation is not differentiable.”

Kimonili · December 8, 2021, 2:51pm

Hi @Azerus,

Thanks for your answer. Yeap, my guess was that the item() method is going to destroy the DAG. I got this from this article Memory Management, Optimisation and Debugging with PyTorch

I found my way around this problem by utilising GPU nodes with more memory, so that my pods are not evicted, but, of course for a higher price