Collecting gradients for multiple losses?

Goldname · August 10, 2019, 11:36pm

I am trying to do something like this (simplified version of my code):

for x in range(1,1000):
  output = model(data)
#Change Data
  loss = loss + F.nll_loss(output, target)

# Calculate gradients of model in backward pass
loss.backward()

# Collect gradients
final_result = final_result + myvar.grad.data

The problem is that a significant number of temporary variables are causing me to run out of GPU memory. Hence, is this next piece of code logically equivalent?

for x in range(1,1000):
  output = model(data)
  loss = F.nll_loss(output, target)
#Change data
# Calculate gradients of model in backward pass
  loss.backward(retain_graph=True)

# Collect gradients
  final_result = final_result + myvar.grad.data
  del loss
  del other_variables

If I am understand how .backward and .grad.data works correctly, then it should be equivalent. However, this is not the case for me and I’m currently looking for the bug.

ptrblck · August 10, 2019, 11:55pm

The addition of final_result = final_result + myvar.grad.data won’t work, if you don’t zero out the gradients in each iteration.
Currently you are accumulating:

final_result = (grad0) + (grad0+grad1) + (grad0+grad1+grad2) + ...

since loss.backward will already accumulate the gradients.
Another approach would be to let loss.backward() accumulate the gradients automatically and just assigning final_result after the loop.


# 1
torch.manual_seed(2809)
model = nn.Linear(1, 2, bias=False)
loss = 0.
for _ in range(1000):
    x = torch.randn(1, 1)
    target = torch.randint(0, 2, (1,))
    output = model(x)
    loss = loss + F.nll_loss(output, target)
    
loss.backward()
final_grad1 = model.weight.grad

# 2
torch.manual_seed(2809)
model = nn.Linear(1, 2, bias=False)
for _ in range(1000):
    x = torch.randn(1, 1)
    target = torch.randint(0, 2, (1,))
    output = model(x)
    loss = F.nll_loss(output, target)
    loss.backward()
    
final_grad2 = model.weight.grad

print(torch.allclose(final_grad1, final_grad2))
> True

Goldname · August 11, 2019, 5:31am

Is this still true that the loss.backward was accumulating the gradients despite me doing “del loss” during each loop?

ptrblck · August 11, 2019, 10:58am

The deletion of the loss shouldn’t make a difference, as the gradients were already accumulated.

Goldname · August 14, 2019, 2:19am

Sorry, One thing I forgot to add: I am doing model.zero_grad() after each iteration. Does this zero out the myvar.grad.data?

ptrblck · August 14, 2019, 12:46pm

model.zero_grad will zero out the gradients of all internal parameters.
If you’ve registered self.myvar = nn.Parameter(...), it should be also zeroed out.