CTC loss produced 'Nan' when using spatial features from AlexNet

tom · June 13, 2020, 11:46am

@ptrblck 's suggestion probably is the first important thing to look out for. If that is the cause, it means that for some of your data, the output length of your AlexNet is insufficient to be able to produce the target sequence.

If that doesn’t help in itself:

Be sure to use the latest PyTorch version.
If you want to find out what is going on, you can do the following:
- at the end of each training step (after optimizer.step()) can you check if any weight has a NaN,
- look at the last inputs into CTC loss when a weight turned NaN (which means that the gradient was NaN).
- does it happen for CPU as well?

Note that you need to check the weights for NaN, not the loss. Very likely the gradients go NaN first and you’ll see the loss turn NaN in the next iter.

Things to look out for:

targets must not contain blank.
input length should be larger than target length + number of repeat positions (+1? I can’t remember).

Best regards

Thomas