I am trying to implement the gradient descent algorithm in a neural network. The network accumulates the loss for all the samples in the training set and then updates the model once like in a simple gradient descent algorithm. However, I am getting a Runtime Error that says-

“RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.”

for epoch in range(num_epochs):
totallossg=0
totallossd=0
disc.zero_grad()
gen.zero_grad()
for iteration in range(4):
for face in faces:
noise=torch.randn(z_dim, z_dim)
fake = gen(noise)
disc_real = disc(face.to(torch.float32))
disc_fake = disc(fake)
lossD_real = criterion(disc_real, torch.ones_like(disc_real))
lossD_fake = criterion(disc_fake, torch.zeros_like(disc_fake))
lossD = (lossD_real + lossD_fake) / 2
totallossd=totallossd+lossD
totallossd=torch.div(totallossd,5)
disc.zero_grad()
totallossd.backward() #This line is generating the above said error
opt_disc.step()

I have tried doing retain_graph=True. That changes the error to -

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [4, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)

Is it something wrong I am trying to do here? Why am I getting the errors when it is a simple gradient descent implementation?

Based on the indentation of the code you posted, totallossd.backward()
is inside the for iteration in range(4): loop.

Consider:

In the first iteration of this loop you compute lossD – let’s call this lossD-1. You then accumulate lossD-1 into totallossd and call totallossd.backward(). This backpropagates through lossD-1’s
computation graph and then frees the graph. But lossD-1 is still part
of totallossd.

In the second iteration, you calculate a new lossD (Let’s call it lossD-2.)
and create its computation graph. You then accumulate lossD-2 into totallossd and call totallossd.backward(). This backpropagates
through lossD-2’s computation graph, which is fine, but because totallossd still contains lossD-1, the .backward() call tries to
backpropagate a second time through lossD-1’s computation graph,
which has already been freed. Hence the error.

If I understand your use case correctly, you want to move all of:

totallossd=torch.div(totallossd,5)
disc.zero_grad()
totallossd.backward() #This line is generating the above said error
opt_disc.step()

out of the for iteration in range(4): loop, which you can do by
reducing the indentation of that section of code by one unit.

As an aside, I would think that in totallossd=torch.div(totallossd,5)
you would actually want to divide by 4, as totallossd has four quantities
summed into it in the for iteration in range(4): loop, rather than five.

Hello Jeet, @KFrank has excellently answered and solved your query.

I’d like to add some more details that’ll help you understand a bit more about pytorch’s autograd engine-
For the first iteration of for iteration in range(4): , totallossd is basically composed of one lossD value that K. Frank refers to as lossD-1.

Now, in this first iteration, right after totallossd.backward() backpropagates gradients, the references to saved tensors (essentially, references to tensor variables needed to calculate the gradient of totallossd) are freed, but the graph still remains in the memory.

Now, in the second iteration, lossD-2 is formed which eventually becomes a part of totallossd; and now when totallossd.backward() is called, an error is thrown as autograd also tries to backprop through lossD-1’s graph that is hanging in the memory but has no references to saved tensors = tensors needed to compute the gradient have already been freed.

When you did retain_graph=True, no saved tensors were freed. Now, note that in the second iteration those same tensors viz disc_real, disc_fake, lossD_real, lossD etc. are being used to eventually create totallossd - they tensor you are calling backward on.

Since references to these very tensors weren’t freed in the first iteration, autograd will be able to monitor the “change” in the values of these tensors in the 2nd iteration as it modifies the values in place.
This monitoring is done as a part of a tensor attribute called version which keeps getting updated as the values keep changing.

Hence, as the second backward call is made and autograd tries to backpropagate gradients through the graph, these tensors are at version 2 which is different from the version (1) they had when the forward graph was constructed.