I’ve only started using Torch very recently, so please forgive me if I’ve made some obvious error. I have been trying to convert an old TensorFlow notebook of mine to Torch, but I am currently running into what I see as an odd issue: the torch
backwards function, when run on my network, always produces NaN results (thus causing the weights to be adjusted to NaN after one step of optimization). There is no issue with feeding the network forward, and from what I can tell from stepping through the process manually, the NaN weights solely spawn from the Conv2d network. Any help on figuring out what the problem is would be greatly appreciated.
Because it’s practically impossible to debug something by description alone, here’s a tiny reproduction of the problematic section of my code.
device = "cuda" # Fails for both "cuda" and "cpu" on a Colab environment binary_crossentropy = nn.BCELoss() def discriminator_loss(real_guess, fake_guess): loss_for_target = binary_crossentropy(torch.ones_like(real_guess), real_guess) loss_for_predicted = binary_crossentropy(torch.zeros_like(fake_guess), fake_guess) return loss_for_target + loss_for_predicted with torch.autograd.detect_anomaly(): class MiniModel(nn.Module): def __init__(self): super(MiniModel, self).__init__() self.last = nn.Conv2d(512, 1, 4, 1) def forward(self, x): return self.last(x) mm = MiniModel().to(device) opt = optim.Adam(mm.parameters()) inp1 = torch.rand((16, 512, 14, 14)).to(device) inp2 = torch.rand((16, 512, 14, 14)).to(device) out = discriminator_loss(mm.last(inp1), mm.last(inp2)) print(out) grads = out.backward() opt.step() print(list(mm.parameters()))
tensor(100.1079, device='cuda:0', grad_fn=<AddBackward0>) --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) <ipython-input-52-e77288483237> in <module>() 25 print(out) 26 ---> 27 grads = out.backward() 28 opt.step() 29 print(list(mm.parameters())) /usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs) 305 create_graph=create_graph, 306 inputs=inputs) --> 307 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) 308 309 def register_hook(self, hook): /usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 154 Variable._execution_engine.run_backward( 155 tensors, grad_tensors_, retain_graph, create_graph, inputs, --> 156 allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag 157 158 RuntimeError: Function 'CudnnConvolutionBackward0' returned nan values in its 1th output.
Interestingly (at least for me), when I use the
cpu instead of the
cuda device, the
detect_anomaly() does not trigger a RuntimeError, but the weights are still readjusted to NaNs:
[Parameter containing: tensor([[[[nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan], ...
As a final note, I have verified that the weights are not initially NaNs; I just didn’t think it necessary to include the code showing that here.
Again, any help would be greatly appreciated; I’m really liking Torch’s API so far, but I perhaps I should have started on a simpler model.