I am trying to do something like this (simplified version of my code):
for x in range(1,1000):
output = model(data)
#Change Data
loss = loss + F.nll_loss(output, target)
# Calculate gradients of model in backward pass
loss.backward()
# Collect gradients
final_result = final_result + myvar.grad.data
The problem is that a significant number of temporary variables are causing me to run out of GPU memory. Hence, is this next piece of code logically equivalent?
for x in range(1,1000):
output = model(data)
loss = F.nll_loss(output, target)
#Change data
# Calculate gradients of model in backward pass
loss.backward(retain_graph=True)
# Collect gradients
final_result = final_result + myvar.grad.data
del loss
del other_variables
If I am understand how .backward and .grad.data works correctly, then it should be equivalent. However, this is not the case for me and I’m currently looking for the bug.
The addition of final_result = final_result + myvar.grad.data
won’t work, if you don’t zero out the gradients in each iteration.
Currently you are accumulating:
final_result = (grad0) + (grad0+grad1) + (grad0+grad1+grad2) + ...
since loss.backward
will already accumulate the gradients.
Another approach would be to let loss.backward()
accumulate the gradients automatically and just assigning final_result
after the loop.
# 1
torch.manual_seed(2809)
model = nn.Linear(1, 2, bias=False)
loss = 0.
for _ in range(1000):
x = torch.randn(1, 1)
target = torch.randint(0, 2, (1,))
output = model(x)
loss = loss + F.nll_loss(output, target)
loss.backward()
final_grad1 = model.weight.grad
# 2
torch.manual_seed(2809)
model = nn.Linear(1, 2, bias=False)
for _ in range(1000):
x = torch.randn(1, 1)
target = torch.randint(0, 2, (1,))
output = model(x)
loss = F.nll_loss(output, target)
loss.backward()
final_grad2 = model.weight.grad
print(torch.allclose(final_grad1, final_grad2))
> True
1 Like
Is this still true that the loss.backward was accumulating the gradients despite me doing “del loss” during each loop?
The deletion of the loss shouldn’t make a difference, as the gradients were already accumulated.
1 Like
Sorry, One thing I forgot to add: I am doing model.zero_grad() after each iteration. Does this zero out the myvar.grad.data?
model.zero_grad
will zero out the gradients of all internal parameters.
If you’ve registered self.myvar = nn.Parameter(...)
, it should be also zeroed out.
1 Like