Exploding NLLLoss

I am using the Transformer module provided by the PyTorch for training a model for text generation. I am using NLLLoss() for measuring the quality of reconstruction. After a certain number of iterations, the loss explodes and changes all weights to nan. This is a log generated by the training program.

root - WARNING - Loss: 203.81146240234375
root - WARNING - Loss: 124.32596588134766
root - WARNING - Loss: 52.12908172607422
root - WARNING - Loss: 47.63726043701172
root - WARNING - Loss: 39.1783561706543
root - WARNING - Loss: 30.14274024963379
root - WARNING - Loss: 33.377098083496094
root - WARNING - Loss: 9.493173539216099e+24

As you can see, the loss goes down for some time as it should and spikes up. I have tried using gradient clipping to mitigate the issue but it did not solve the problem.

criterion_1 = nn.NLLLoss()
y_hat = model(X_train)
y_hat = y_hat.transpose(0,1)
mask = (tgt!=pad_idx).bool()
y_hat = nn.functional.log_softmax(y_hat, dim = -1)
cel = criterion_1(y_hat.reshape(-1,vocab_size), tgt.reshape(-1))
loss = cel.masked_select(mask.reshape(-1)).sum()
torch.nn.utils.clip_grad_value_(model.parameters(), 100)

The above given is the code I am using for calculating the loss.

perhaps perfect predictors exist and training reaches (1,0,0,…) state. y_hat = y_hat.clamp(-b,b) should solve that (with b like 10…20, before softmax)

For some reason, clamping the predictions is causing the loss to increase after a certain point. This continues until some of model weights becomes nan.

Actually, I suggested early clamping, and that’s tricky with log_softmax. Post log_softmax clamping (-20.,-1e-6) or an additional loss mask may work instead. Or it is something else, I’d place a breakpoint and inspect problematic network output.

Few things before trying gradient clipping:

  1. What does your input data look like? Make sure it’s in the correct form, you would expect. Sometimes unnormalized input can cause huge loss values.
  2. What optimizer and lr are you using?
  3. Not sure if it’s a good idea to sum the loss values before loss.backward()
  1. Input data is in the shape (batch_size, max_len) and output is in the shape (batch_size, max_len, vocab_size)
  2. I am using Adam optimizer with lr of 0.001
  3. I tried training the model taking the mean loss instead of the sum, I am still getting a spike in the loss.

I tried clamping the output post log softmax, this is the log generated

root - WARNING - Loss: 7753.49169921875
root - WARNING - Loss: 6842.0830078125
root - WARNING - Loss: 13360.0
root - WARNING - Loss: 13180.0
root - WARNING - Loss: nan

The NLLLoss becomes nan after few batches

A few things you can check:

  1. Ensure that this output is like what you would expect (ie the scale is same as tgt).
  1. I’m not sure what this part below is doing.

Instead can this variable loss be removed and simply cel.backward() be used.

I usually go for criterion = nn.CrossEntropyLoss() to avoid confusion.

masked_select is for removing the loss corresponding to the <pad> token. I’m building an architecture similar to a variational autoencoder which uses log-likelihood for the loss which is why I used NLLLoss over CrossEntropyLoss

Then it is something else, probably. If you use sampling from trainable distributions, the issue can be there.

Generally autograd.set_detect_anomaly(True) should show the problematic part (NaN inducing). As it slows down training, it is better to enable it late.