Hi I have the following code snippet to train my model… It works for some time ad then the following error is thrown. I am unable to debug it, I read all the solutions for the same error, doesn’t help.
It starts to train … I’m showing what the actual variables are which are going inside the loss function (they are compatible dimension wise and other respect, I still don’t know why such error occurs)
The line that is causing a problem is the line that prints cord3, sometimes pytorch doesn’t know how to properly turn it into a string. Maybe it contains NaNs, I don’t know.
I can only suggest that you remove that line and either everything else will work, or you will get nan loss, or you could get a more informative error.
doesn’t help, still the same… I also tried checking whether cord3 (and other variable is NaN or not just before crash by doing cord3 != cord3 ) and it doesn’t become NaN
Because of the asynchronous cuda calls, the stack trace sometimes points to a wrong line of code. Could you try to run your script with the following command: CUDA_LAUNCH_BLOCKING=1 python script.py and post the stack trace again.