DNN performance suddenly crashes

Sam_Lerman · April 28, 2022, 12:41am

It goes from 60% training acc to near 0. I suspect some kind of floating point error occurred. Is there a way to check this / prevent this?

eqy · April 28, 2022, 12:53am

You may want to inspect the model outputs and intermediate activations for NaN values e.g., via torch.isnan — PyTorch 1.11.0 documentation.

osama-usuf · April 28, 2022, 2:58am

You might also want to log your loss function - in my experience just logging the epoch-wise loss helps. One simple case where this may happen is if you choose a really high network learning rate, and the loss ends up exploding and going to inf., which is NaN and can lead to drops in network accuracies.

rubijade · April 28, 2022, 3:22am

Try to decrease or increase the learning rate, maybe when there is no change in the loss after some epoch. ReduceLROnPlateau — PyTorch 1.11.0 documentation
You can also stop the training earlier.