DNN performance suddenly crashes

It goes from 60% training acc to near 0. I suspect some kind of floating point error occurred. Is there a way to check this / prevent this?

You may want to inspect the model outputs and intermediate activations for NaN values e.g., via torch.isnan — PyTorch 1.11.0 documentation.

You might also want to log your loss function - in my experience just logging the epoch-wise loss helps. One simple case where this may happen is if you choose a really high network learning rate, and the loss ends up exploding and going to inf., which is NaN and can lead to drops in network accuracies.

Try to decrease or increase the learning rate, maybe when there is no change in the loss after some epoch. ReduceLROnPlateau — PyTorch 1.11.0 documentation
You can also stop the training earlier.