When I use backward() in the loop, the message failed on the second loop

I use loops to propagate forward and back through my model, but I get an error notification on the second loop with the following error:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

I’ve explored various methods available on the network, ensuring that I don’t employ backpropagation twice in one loop. It doesn’t seem like there’s a point to release the graph.

I have tried my best to check the error, but there is still no solution. The code in question is as follows (not all shown because I don’t think it’s necessary)

criterion = torch.nn.MSELoss().type(data_type) 
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate) 

image_queue = queue.Queue()
# a separate thread for calculating a value that is used by a main thread
denoise_thread = Thread(target=lambda q, f, p: q.put(f(*p)),  # q -> queue, f -> func, p -> params
                        args=(image_queue, non_local_means,
                              [benchmark_image.clone().squeeze().cpu().detach().numpy(), 3]))
denoise_thread.start()

temp_benchmark = benchmark_image.clone()  #  copy of the benchmark for intermediate calculations
for i in range(num_iter + 1):

    optimizer.zero_grad()

    out = net(net_input)
    temp = benchmark_image - lagrange_multiplier
    loss_net = criterion(out, decrease_image)     # this loss backward normally
    loss_red = criterion(out, temp)               # this is where things go wrong

    total_loss = loss_net + mu * loss_red
    total_loss.backward()                         # FAIL backward!!!

    # updates the benchmark value every iteration a certain number of times
    if i % 30 == 0:
        denoise_thread.join()
        temp_benchmark = image_queue.get()
        temp_benchmark = torch.from_numpy(temp_benchmark)[None, :].cuda()
        temp_benchmark.requires_grad_()
            
        # as before, it is used to update values that will be used by the main thread
        denoise_thread = Thread(target=lambda q, f, p: q.put(f(*p)),
                                args=(image_queue, non_local_means,
                                      [benchmark_image.clone().squeeze().cpu().detach().numpy(), 3]))
        denoise_thread.start()

    benchmark_image = 1 / (beta + mu) * (beta * temp_benchmark + mu * (out + lagrange_multiplier))
    lagrange_multiplier = lagrange_multiplier + out - benchmark_image

    optimizer.step()
       
if denoise_thread.is_alive():
    denoise_thread.join()

Can you explain a bit more what your training procedure is or are you facing this error during the training cycle, anyways here are some if the common ways you may not have the computation graphs.

  1. you are computing in the torch.no_grad() settings.
  2. you have set it all in eval mode.

Hi two things you can try that immediately jump out to me,
1)you can update your benchmark after you have done the optimizer.step().
2)you can set your temp_benchmark.requires_grad = False and then calculate if thats what you are going for.

What I assume is going on is there must some kind of disconnect in the computation graph during updating the benchmark step.

I have tried both of the methods you mentioned and unfortunately the problem still exists and nothing has changed.

Before setting the optimizer to zero grad set the model to net.train() for every epoch.

Unfortunately, the problem still exists

Thanks to all the kind people who offered help, after I tried to modify the code repeatedly, I found a way to solve the problem, which is to directly empty an input gradient value of loss, as shown below:

loss_red = criterion(out, temp.detach_())  # What a simple but wonderful solution

Although this has solved the problem, I still don’t know why this problem occurs, perhaps someone can tell me the reason.

Anyway, it is a pleasure to have this trouble solved. LOL

1 Like