Conv2d.backwards always results in NaN

I’ve only started using Torch very recently, so please forgive me if I’ve made some obvious error. I have been trying to convert an old TensorFlow notebook of mine to Torch, but I am currently running into what I see as an odd issue: the torch backwards function, when run on my network, always produces NaN results (thus causing the weights to be adjusted to NaN after one step of optimization). There is no issue with feeding the network forward, and from what I can tell from stepping through the process manually, the NaN weights solely spawn from the Conv2d network. Any help on figuring out what the problem is would be greatly appreciated.

Because it’s practically impossible to debug something by description alone, here’s a tiny reproduction of the problematic section of my code.

device = "cuda" # Fails for both "cuda" and "cpu" on a Colab environment

binary_crossentropy = nn.BCELoss()

def discriminator_loss(real_guess, fake_guess):
  loss_for_target = binary_crossentropy(torch.ones_like(real_guess), real_guess)
  loss_for_predicted = binary_crossentropy(torch.zeros_like(fake_guess), fake_guess)

  return loss_for_target + loss_for_predicted

with torch.autograd.detect_anomaly():
  class MiniModel(nn.Module):
    def __init__(self):
      super(MiniModel, self).__init__()
      self.last = nn.Conv2d(512, 1, 4, 1)
    def forward(self, x):
      return self.last(x)
  
  mm = MiniModel().to(device)
  opt = optim.Adam(mm.parameters())
  inp1 = torch.rand((16, 512, 14, 14)).to(device)
  inp2 = torch.rand((16, 512, 14, 14)).to(device)
  
  out = discriminator_loss(mm.last(inp1), mm.last(inp2))
  print(out)

  grads = out.backward()
  opt.step()
  print(list(mm.parameters()))

The results:

tensor(100.1079, device='cuda:0', grad_fn=<AddBackward0>)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-52-e77288483237> in <module>()
     25   print(out)
     26 
---> 27   grads = out.backward()
     28   opt.step()
     29   print(list(mm.parameters()))

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    305                 create_graph=create_graph,
    306                 inputs=inputs)
--> 307         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    308 
    309     def register_hook(self, hook):

/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    154     Variable._execution_engine.run_backward(
    155         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 156         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    157 
    158 

RuntimeError: Function 'CudnnConvolutionBackward0' returned nan values in its 1th output.

Interestingly (at least for me), when I use the cpu instead of the cuda device, the detect_anomaly() does not trigger a RuntimeError, but the weights are still readjusted to NaNs:

[Parameter containing:
tensor([[[[nan, nan, nan, nan],
          [nan, nan, nan, nan],
          [nan, nan, nan, nan],
...

As a final note, I have verified that the weights are not initially NaNs; I just didn’t think it necessary to include the code showing that here.

Again, any help would be greatly appreciated; I’m really liking Torch’s API so far, but I perhaps I should have started on a simpler model.

Try making the following changes:

inside your loss fn:

  loss_for_target = binary_crossentropy(real_guess, torch.ones_like(real_guess).to(device))  # !! invert order of input and target
  loss_for_predicted = binary_crossentropy(fake_guess, torch.zeros_like(fake_guess).to(device))  # !! invert order of input and target

inside your model, make sure you return values between 0 and 1 (else we can’t compute BCELoss). something like this:
return nn.Sigmoid()(self.last(x)) # !! force values to be in (0, 1) interval

finally, make sure you apply your entire model, not just one layer:
out = discriminator_loss(mm(inp1), mm(inp2)) # !! applying the entire model, not just the "last"

With these changes, the code runs for me.

2 Likes

You are passing the arguments to the criterion in the wrong order and should also use nn.BCEWithLogitsLoss based on the model output.

@Andrei_Cristea was faster! :slight_smile:

1 Like

Ah, thank you both @Andrei_Cristea and @ptrblck! This was definitely an issue of converting from TensorFlow without fully understanding the differences; TF has a from_logits argument in its BinaryCrossentropy class, while Torch provides two separate classes. Similarly, they swap the order of true and false labels when applying the loss function. It seems that I should read the documentation more carefully rather than assuming that similarly named classes have similar calling conventions across the two APIs.