FP16 gives NaN loss when using pre-trained model

You could register forward hooks for each module and check the output. Once you could isolate the first layer creating the invalid values, you could check the input as well as the parameters to further isolate it.