Collecting gradients for multiple losses?

I am trying to do something like this (simplified version of my code):

for x in range(1,1000):
  output = model(data)
#Change Data
  loss = loss + F.nll_loss(output, target)

# Calculate gradients of model in backward pass
loss.backward()

# Collect gradients
final_result = final_result + myvar.grad.data

The problem is that a significant number of temporary variables are causing me to run out of GPU memory. Hence, is this next piece of code logically equivalent?

for x in range(1,1000):
  output = model(data)
  loss = F.nll_loss(output, target)
#Change data
# Calculate gradients of model in backward pass
  loss.backward(retain_graph=True)

# Collect gradients
  final_result = final_result + myvar.grad.data
  del loss
  del other_variables

If I am understand how .backward and .grad.data works correctly, then it should be equivalent. However, this is not the case for me and I’m currently looking for the bug.

The addition of final_result = final_result + myvar.grad.data won’t work, if you don’t zero out the gradients in each iteration.
Currently you are accumulating:

final_result = (grad0) + (grad0+grad1) + (grad0+grad1+grad2) + ...

since loss.backward will already accumulate the gradients.
Another approach would be to let loss.backward() accumulate the gradients automatically and just assigning final_result after the loop.


# 1
torch.manual_seed(2809)
model = nn.Linear(1, 2, bias=False)
loss = 0.
for _ in range(1000):
    x = torch.randn(1, 1)
    target = torch.randint(0, 2, (1,))
    output = model(x)
    loss = loss + F.nll_loss(output, target)
    
loss.backward()
final_grad1 = model.weight.grad

# 2
torch.manual_seed(2809)
model = nn.Linear(1, 2, bias=False)
for _ in range(1000):
    x = torch.randn(1, 1)
    target = torch.randint(0, 2, (1,))
    output = model(x)
    loss = F.nll_loss(output, target)
    loss.backward()
    
final_grad2 = model.weight.grad

print(torch.allclose(final_grad1, final_grad2))
> True
1 Like

Is this still true that the loss.backward was accumulating the gradients despite me doing “del loss” during each loop?

The deletion of the loss shouldn’t make a difference, as the gradients were already accumulated.

1 Like

Sorry, One thing I forgot to add: I am doing model.zero_grad() after each iteration. Does this zero out the myvar.grad.data?

model.zero_grad will zero out the gradients of all internal parameters.
If you’ve registered self.myvar = nn.Parameter(...), it should be also zeroed out.

1 Like