Hi I have the following code snippet to train my model… It works for some time ad then the following error is thrown. I am unable to debug it, I read all the solutions for the same error, doesn’t help.
It starts to train … I’m showing what the actual variables are which are going inside the loss function (they are compatible dimension wise and other respect, I still don’t know why such error occurs)
Any clues what to do?
The line that is causing a problem is the line that prints
cord3, sometimes pytorch doesn’t know how to properly turn it into a string. Maybe it contains NaNs, I don’t know.
I can only suggest that you remove that line and either everything else will work, or you will get nan loss, or you could get a more informative error.
Hey, thanks for the reply.
I tried removing it, but I am still a different error now (similar)
It looks like the cuda code is doing stuff out of order and not properly coping with the synchronisation.
You can try adding
torch.cuda.synchronize() before the training loop. I’m only guessing, I don’t have a GPU so I can’t say whether it will help.
doesn’t help, still the same… I also tried checking whether cord3 (and other variable is NaN or not just before crash by doing
cord3 != cord3 ) and it doesn’t become NaN
Because of the asynchronous cuda calls, the stack trace sometimes points to a wrong line of code. Could you try to run your script with the following command:
CUDA_LAUNCH_BLOCKING=1 python script.py and post the stack trace again.
Is there a way to run it in notebook, or I have to make a script and run it?
Oh, I usually don’t run notebooks, so just try to export it as a script. Would that work?
thanks, I was able to debug. Thank you