So if I understand you correctly, even perfectly valid inputs and models can result in Inf, NaN problems. Meaning, this does not necessarily indicate a bug in my code.
Right, except that I’ve run into a similar error in a bigger project that already uses torch.float32 so I’d like to figure out how to clamp the output to protect against such failures. Given:
def training_step(self, batch, batch_idx) -> STEP_OUTPUT:
input, expected = batch
actual = self(input)
loss = self.loss_function(actual, expected)
return loss
Is this the right way to go about clamping the output?
def training_step(self, batch, batch_idx) -> STEP_OUTPUT:
input, expected = batch
actual = self(input)
loss = self.loss_function(actual, expected)
limits = torch.finfo(torch.float16)
return torch.clamp(loss, limits.min, limits.max) # <--- clamp added
I tried this in the testcase but the problem remains.
You’re probably right but I had to step deep into PyTorch’s code to figure out that the loss was being scaled up from 100 to ~600,000. Is there a hook I can register which will print out the layer values so I don’t have to do this in the future? The goal is see all of the weight and gradient values from the debugging hooks.
- Walking through the clamped testcase code, I see the loss is equal to
1.4138at the end of the forward pass and somehow gets scaled up to92655.7891right before thebackward()gets invoked. Something smells wrong here. Why is such a small loss value (which is being clamped no less!) being scaled up to such a large value that is out of range? I tracked this down tonative_amp.pyin PyTorch Lightning where I see them scaling the loss up by a factor of 65536. From the looks of things, the value will always overflow… Any idea what could be going on?
Thanks.