Trying to debug RuntimeError: Trying to backward through the graph a second time

Aladdine_Ben_Zekri · May 30, 2023, 3:41am

Hello, so I am trying to implement an architecture from the Cam2BEV paper (https://arxiv.org/pdf/2005.04078.pdf) and I’ve been having this error for days and I haven’t been able to figure out its origin:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed).

As a brief description, the input is a list of segmented images all of the same size and the target is also a segmented image of the same size. The issue seems to be with the way I wrote different parts of my model and not with the training loop, however here is a snippet of the training code:

for epoch in range(num_epochs):
    model.train()
    
    running_loss = 0.0
    running_acc = 0.0
    for images, labels in train_loader:

        labels = labels.max(dim=1)[1]
        
        optimizer.zero_grad()
        
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        batch_acc = metric(outputs.detach(), labels.detach()).item()
        running_acc += batch_acc
        
    train_loss = running_loss/len(train_loader)
    train_acc = running_acc/len(train_loader)
    
    print(f"Epoch {epoch+1}/{num_epochs} - Training Loss: {train_loss:.4f} - Training Accuracy: {train_accuracy:.4f}")

The models can be found in this link: Cam2EBV/models.py at main · AlaaBenZekri/Cam2EBV · GitHub

Here you can find a more detailed message error:

C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\__init__.py:200: UserWarning: Error detected in CudnnBatchNormBackward0. Traceback of forward call that caused the error:
  File "C:\Users\alaae\Desktop\PFE\Code\train.py", line 58, in <module>
    outputs = model(images)
  File "C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\alaae\Desktop\PFE\Code\models.py", line 211, in forward
    joiner_output = self.joiner(encoder_outputs)
  File "C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\alaae\Desktop\PFE\Code\models.py", line 148, in forward
    t = self.joiner_layers[d](t)
  File "C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\container.py", line 217, in forward
    input = module(input)
  File "C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\batchnorm.py", line 171, in forward
    return F.batch_norm(
  File "C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\functional.py", line 2450, in batch_norm
    return torch.batch_norm(
 (Triggered internally at ..\torch\csrc\autograd\python_anomaly_mode.cpp:119.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "C:\Users\alaae\Desktop\PFE\Code\train.py", line 61, in <module>
    loss.backward()
  File "C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "C:\Users\alaae\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

I’ve been stuck at this issue for a week and I would really appreciate any kind of help.

eqy · May 30, 2023, 4:12am

I’m not familiar with the model architecture, but my guess is that you would want to clear
self.joiner_outputs before the loop here:

github.com

AlaaBenZekri/Cam2EBV/blob/eead9e9297eac9d7af614228ac4797a386b6079a/models.py#L140


      
                  layer.append(nn.Conv2d(filters*n_inputs, filters, kernel_size=3, padding = 1))

                  layer.append(nn.ReLU(True))

                  layer.append(nn.BatchNorm2d(filters))

                  layer.append(nn.Conv2d(filters, filters, kernel_size=3, padding = 1))

                  layer.append(nn.ReLU(True))

                  layer.append(nn.BatchNorm2d(filters))

                  self.joiner_layers.append(nn.Sequential(*layer))

          

          def forward(self, encoder_outputs):

              udepth = len(encoder_outputs[0])

          

              for d in range(udepth):

                  filters = (2**d)*self.filters1

                  warped_maps = []

                  for i in range(self.n_inputs):

                      t = self.depth_stn_layers[d][i](encoder_outputs[i][d])

                      warped_maps.append(t)

                  t = torch.cat(warped_maps, dim=1) if self.n_inputs > 1 else warped_maps[0]

                  t = self.joiner_layers[d](t)

                  self.joiner_outputs.append(t)

              return self.joiner_outputs

As currently written, it looks like the Joiner module keeps its output alive by appending it to a list for each iteration. But after the first backward call, the graph for this output will be free’d even though it is reused in the next iteration as an input to decoder. This looks strange, and if this is your actual intent you would either want to append a clone of the output instead (likely with .detach() as well) or call backward with retain_graph=True. The second case would apply if you want gradients to flow through according to the previously computed outputs for each iteration, but it would increase memory usage.

I suspect the issue is that joiner_outputs should be recomputed from scratch every time and shouldn’t be a member variable that is kept alive across every iteration.

Aladdine_Ben_Zekri · May 30, 2023, 7:24am

Actually yes my mistake was that I didn’t clear the output of the joiner model at each iteration, kinda stupid on my part