Hi Nil!
Presumably elements of your gradient
array share parts of the same
computation graph. That is, the computation of gradient[0]
and, for
example, gradient[1]
partially overlap. Calling gradient[0].backward()
deletes gradient[0]
's computation graph, including any parts of it that
are shared by gradient[1]
's computation graph. So when you then call
gradient[1].backward()
, parts of gradient[1]
's computation graph
have been deleted, leading to the error you report.
retain_graph = True
tells autograd not to delete the computation graph,
so you could do something like:
optim.zero_grad()
for i, grad in enumerate (gradient): # gradient is a 50 dimensional array
loss = grad
if i < 49:
loss.backward (retain_graph = True)
else:
loss.backward()
optim.step()
(The final call to loss.backward()
does not have retain_graph = True
because you do need to delete computation graph at some point, typically
before calling `optim.step() and / or performing the next forward pass.)
This is a perfectly reasonable way to use autograd and .backward()
.
However, it’s likely to be inefficient, because you repeat (the shared part
of) the backward pass fifty times.
loss.backward()
computes the gradient of loss
with respect to the
parameters on which loss
depends and accumulates that gradient into
those parameters’ .grad
properties. But computing the gradient is a linear
operation (so that grad_of (a + b) = grad_of (a) + grad_of (b)
).
So you are likely better off with:
optim.zero_grad()
loss_total = 0
for grad in gradient:
loss_total = loss_total + grad
loss_total.backward()
optim.step()
This only performs a single backward pass (rather than fifty) and, up to
numerical round-off error, computes the same final gradient (as stored in
the various parameters’ .grad
properties) as does the version that called
.backward()
fifty times.
As an aside, you will probably also achieve additional efficiency (and code
cleanliness) if you can arrange your computation so that gradient
is a
single one-dimensional pytorch tensor of length fifty that is computed all at
once with pytorch tensor operations rather than an array of fifty length-one
pytorch tensors that is computed entry by entry.
Best.
K. Frank