LayerNorm's grads become NaN after first epoch

So if I understand you correctly, even perfectly valid inputs and models can result in Inf, NaN problems. Meaning, this does not necessarily indicate a bug in my code.

Right, except that I’ve run into a similar error in a bigger project that already uses torch.float32 so I’d like to figure out how to clamp the output to protect against such failures. Given:

def training_step(self, batch, batch_idx) -> STEP_OUTPUT:
    input, expected = batch

    actual = self(input)
    loss = self.loss_function(actual, expected)
    return loss

Is this the right way to go about clamping the output?

def training_step(self, batch, batch_idx) -> STEP_OUTPUT:
    input, expected = batch

    actual = self(input)
    loss = self.loss_function(actual, expected)
    limits = torch.finfo(torch.float16)
    return torch.clamp(loss, limits.min, limits.max) # <--- clamp added

I tried this in the testcase but the problem remains.

You’re probably right but I had to step deep into PyTorch’s code to figure out that the loss was being scaled up from 100 to ~600,000. Is there a hook I can register which will print out the layer values so I don’t have to do this in the future? The goal is see all of the weight and gradient values from the debugging hooks.

  1. Walking through the clamped testcase code, I see the loss is equal to 1.4138 at the end of the forward pass and somehow gets scaled up to 92655.7891 right before the backward() gets invoked. Something smells wrong here. Why is such a small loss value (which is being clamped no less!) being scaled up to such a large value that is out of range? I tracked this down to native_amp.py in PyTorch Lightning where I see them scaling the loss up by a factor of 65536. From the looks of things, the value will always overflow… Any idea what could be going on?

Thanks.