Loss nan when resuming from a pretrained model

ptrblck · August 13, 2020, 7:50am

For some reason the output is getting NaN values.
Could you break the training loop, after you’ve encountered the first NaN and check all parameters of the model?
E.g. you could print their abs().max() via:

for name, param in model.named_parameters():
    print(name, param.abs().max())

If this looks alright, you could repeat the last forward iteration (since the input contains valid values) and check all intermediate activations to narrow down, which layer creates the NaN outputs using forward hooks as described here.