Sometimes it works sometimes it doesn't

uddeshya_upadhyay · March 17, 2018, 4:09pm

Hi I have the following code snippet to train my model… It works for some time ad then the following error is thrown. I am unable to debug it, I read all the solutions for the same error, doesn’t help.

It starts to train … I’m showing what the actual variables are which are going inside the loss function (they are compatible dimension wise and other respect, I still don’t know why such error occurs)

Any clues what to do?

jpeg729 · March 17, 2018, 5:41pm

The line that is causing a problem is the line that prints cord3, sometimes pytorch doesn’t know how to properly turn it into a string. Maybe it contains NaNs, I don’t know.

I can only suggest that you remove that line and either everything else will work, or you will get nan loss, or you could get a more informative error.

uddeshya_upadhyay · March 17, 2018, 5:56pm

Hey, thanks for the reply.
I tried removing it, but I am still a different error now (similar)

any clues?

jpeg729 · March 17, 2018, 6:02pm

It looks like the cuda code is doing stuff out of order and not properly coping with the synchronisation.

You can try adding torch.cuda.synchronize() before the training loop. I’m only guessing, I don’t have a GPU so I can’t say whether it will help.

uddeshya_upadhyay · March 17, 2018, 6:25pm

doesn’t help, still the same… I also tried checking whether cord3 (and other variable is NaN or not just before crash by doing cord3 != cord3 ) and it doesn’t become NaN

ptrblck · March 17, 2018, 7:44pm

Because of the asynchronous cuda calls, the stack trace sometimes points to a wrong line of code. Could you try to run your script with the following command: CUDA_LAUNCH_BLOCKING=1 python script.py and post the stack trace again.

uddeshya_upadhyay · March 17, 2018, 7:45pm

Is there a way to run it in notebook, or I have to make a script and run it?

ptrblck · March 17, 2018, 7:46pm

Oh, I usually don’t run notebooks, so just try to export it as a script. Would that work?

uddeshya_upadhyay · March 17, 2018, 8:22pm

thanks, I was able to debug. Thank you