Dear all, i am implementing the paper GradNorm, I have found 2 implementations but both of them require a lot of memory.

Let’s suppose i have a loss L and a network W:

L = loss1 + loss2 + lossN

W are the weights with grad = True

I want to do :

(dloss1)/dW

save gradient norm

(dloss2)/dW

save gradient norm

(dL)/dW

update parameters

An example of implementation can be found at:

the interesting lines are:

```
# compute and retain gradients
total_weighted_loss.backward(retain_graph=True)
# GRADNORM - learn the weights for each tasks gradients
# zero the w_i(t) gradients since we want to update the weights using gradnorm loss
self.weights.grad = 0.0 * self.weights.grad
W = list(self.model.mtn.shared_block.parameters())
norms = []
for w_i, L_i in zip(self.weights, task_losses):
# gradient of L_i(t) w.r.t. W
gLgW = torch.autograd.grad(L_i, W, retain_graph=True)
# G^{(i)}_W(t)
norms.append(torch.norm(w_i * gLgW[0]))
norms = torch.stack(norms)
```

BUT I want to avoid retain_graph = True as this leads to out of memory for my network.

A similar and very good question has been asked already, but has no answer:

Is there a way to efficiently perform derivatives w.r.t. other losses without retain_graph?

- e.g., calling backward() on copies of W (W1,W2) w.r.t L1, then L2…
- e.g., multidimensional loss L = [l1,l2,lTOT] and then L.backward(Tensor) with Tensor somehow pointing to the right loss [1,0,0], [0,1,0],[0,0,1]

Additionaly, would it be possible to clarify how to inpsect and visualize which are the tensor responsible for memory leaks?

@albanD @colensbury could you help? your opinions are very valuable

Many thanks for your help,

Stefano