Hello Forum!
I tried a test where I expected to get “RuntimeError: Trying to backward
through the graph a second time,” but instead I get results that make no
sense to me.
The basic question: What should happen if you move part of the building
of the computation graph out of the optimization loop?
Here is the test script (with a couple of alternatives commented out):
import torch
print (torch.__version__)
a = torch.arange (9.0).reshape (3, 3)
a.requires_grad = True
opt = torch.optim.SGD ((a,), lr = 0.1)
b = a[:, 1].contiguous() # a is optimized (with stale gradient) but b remains unchanged
# b = a[:, 1] # works for first opt.step(), but then fails silently
for i in range (3):
# b = a[:, 1] # works as expected inside the loop
loss = (b[1] - 10)**2
opt.zero_grad()
loss.backward()
opt.step()
print ('a[1, 1]: ', a[1, 1])
print ('b[1]: ', b[1])
And here is its output:
2.0.1
a[1, 1]: tensor(5.2000, grad_fn=<SelectBackward0>)
b[1]: tensor(4., grad_fn=<SelectBackward0>)
a[1, 1]: tensor(6.4000, grad_fn=<SelectBackward0>)
b[1]: tensor(4., grad_fn=<SelectBackward0>)
a[1, 1]: tensor(7.6000, grad_fn=<SelectBackward0>)
b[1]: tensor(4., grad_fn=<SelectBackward0>)
The first question is why, when .backward()
is called the second time
through the loop, does autograd not complain about part of the computation
graph having been freed? I would have thought that the first .backward()
would have freed the connection between b
and a
(which is not rebuilt inside
of the optimization loop).
Why do we get the results we do? It appears that opt.step()
keeps using
the gradient produced by the first .backward()
and the subsequent calls
don’t update it (but also don’t issue an error).
(The second – commented out – alternative without the .contiguous()
fails a little differently, but also without an error message. In this case, a
is updated just once (and that change is reflected in b
’s view into a
) and
then stays constant.)
I see this in versions 2.0.1 and 1.11.0, so it’s not an obvious one-off regression.
[Edit: Adding another example script.]
I’ve added a modified version of the above script that highlights a particular
facet of the issue.
After loss
is first computed, it no longer changes when a
changes. This
makes sense because b
, the link between a
and loss
is not updated
within the optimization loop; the line b = a[:, 1].contiguous()
is only
executed once before the loop begins.
Nonetheless, even though loss
no longer depends on a
inside of loop, a
nonzero gradient flows back from loss
to a
. (Note, constant loss, but a
nonzero gradient, is clearly mathematically incorrect.) It’s as if the part of
the computation graph that connects b
to a
is still in operation, even though
it should have been freed by the call to loss.backward()
.
Here is a script that prints out a.grad
:
import torch
print (torch.__version__)
a = torch.arange (9.0).reshape (3, 3)
a.requires_grad = True
opt = torch.optim.SGD ((a,), lr = 0.1)
b = a[:, 1].contiguous() # a is optimized (with stale gradient) but b remains unchanged
for i in range (2):
loss = (b[1] - 10)**2
print ('i:', i, ' loss =', loss) # loss doesn't change (after first computed)
print ('i:', i, 'before zero_grad()')
print ('a = ...')
print (a)
print ('a.grad = ...')
print (a.grad)
opt.zero_grad()
print ('i:', i, 'after zero_grad(), before backward()')
print ('a = ...')
print (a)
print ('a.grad = ...')
print (a.grad)
loss.backward()
print ('i:', i, 'after backward(), before step()')
print ('a = ...')
print (a)
print ('a.grad = ...')
print (a.grad) # grad is nonzero even though loss no longer depends on a
opt.step()
print ('i:', i, 'after step()')
print ('a = ...')
print (a) # a changes with every step
print ('a.grad = ...')
print (a.grad)
And here is its output:
2.0.1
i: 0 loss = tensor(36., grad_fn=<PowBackward0>)
i: 0 before zero_grad()
a = ...
tensor([[0., 1., 2.],
[3., 4., 5.],
[6., 7., 8.]], requires_grad=True)
a.grad = ...
None
i: 0 after zero_grad(), before backward()
a = ...
tensor([[0., 1., 2.],
[3., 4., 5.],
[6., 7., 8.]], requires_grad=True)
a.grad = ...
None
i: 0 after backward(), before step()
a = ...
tensor([[0., 1., 2.],
[3., 4., 5.],
[6., 7., 8.]], requires_grad=True)
a.grad = ...
tensor([[ 0., 0., 0.],
[ 0., -12., 0.],
[ 0., 0., 0.]])
i: 0 after step()
a = ...
tensor([[0.0000, 1.0000, 2.0000],
[3.0000, 5.2000, 5.0000],
[6.0000, 7.0000, 8.0000]], requires_grad=True)
a.grad = ...
tensor([[ 0., 0., 0.],
[ 0., -12., 0.],
[ 0., 0., 0.]])
i: 1 loss = tensor(36., grad_fn=<PowBackward0>)
i: 1 before zero_grad()
a = ...
tensor([[0.0000, 1.0000, 2.0000],
[3.0000, 5.2000, 5.0000],
[6.0000, 7.0000, 8.0000]], requires_grad=True)
a.grad = ...
tensor([[ 0., 0., 0.],
[ 0., -12., 0.],
[ 0., 0., 0.]])
i: 1 after zero_grad(), before backward()
a = ...
tensor([[0.0000, 1.0000, 2.0000],
[3.0000, 5.2000, 5.0000],
[6.0000, 7.0000, 8.0000]], requires_grad=True)
a.grad = ...
None
i: 1 after backward(), before step()
a = ...
tensor([[0.0000, 1.0000, 2.0000],
[3.0000, 5.2000, 5.0000],
[6.0000, 7.0000, 8.0000]], requires_grad=True)
a.grad = ...
tensor([[ 0., 0., 0.],
[ 0., -12., 0.],
[ 0., 0., 0.]])
i: 1 after step()
a = ...
tensor([[0.0000, 1.0000, 2.0000],
[3.0000, 6.4000, 5.0000],
[6.0000, 7.0000, 8.0000]], requires_grad=True)
a.grad = ...
tensor([[ 0., 0., 0.],
[ 0., -12., 0.],
[ 0., 0., 0.]])
Context: I got confused / tripped up by this when exploring what I thought
would be a simple explanation / fix for the issue in the thread:
Thanks for any insight!
K. Frank