I am trying to do something like this (simplified version of my code):
for x in range(1,1000):
output = model(data)
#Change Data
loss = loss + F.nll_loss(output, target)
# Calculate gradients of model in backward pass
loss.backward()
# Collect gradients
final_result = final_result + myvar.grad.data
The problem is that a significant number of temporary variables are causing me to run out of GPU memory. Hence, is this next piece of code logically equivalent?
for x in range(1,1000):
output = model(data)
loss = F.nll_loss(output, target)
#Change data
# Calculate gradients of model in backward pass
loss.backward(retain_graph=True)
# Collect gradients
final_result = final_result + myvar.grad.data
del loss
del other_variables
If I am understand how .backward and .grad.data works correctly, then it should be equivalent. However, this is not the case for me and I’m currently looking for the bug.
The addition of final_result = final_result + myvar.grad.data won’t work, if you don’t zero out the gradients in each iteration.
Currently you are accumulating:
since loss.backward will already accumulate the gradients.
Another approach would be to let loss.backward() accumulate the gradients automatically and just assigning final_result after the loop.
# 1
torch.manual_seed(2809)
model = nn.Linear(1, 2, bias=False)
loss = 0.
for _ in range(1000):
x = torch.randn(1, 1)
target = torch.randint(0, 2, (1,))
output = model(x)
loss = loss + F.nll_loss(output, target)
loss.backward()
final_grad1 = model.weight.grad
# 2
torch.manual_seed(2809)
model = nn.Linear(1, 2, bias=False)
for _ in range(1000):
x = torch.randn(1, 1)
target = torch.randint(0, 2, (1,))
output = model(x)
loss = F.nll_loss(output, target)
loss.backward()
final_grad2 = model.weight.grad
print(torch.allclose(final_grad1, final_grad2))
> True
model.zero_grad will zero out the gradients of all internal parameters.
If you’ve registered self.myvar = nn.Parameter(...), it should be also zeroed out.