Sometimes it works sometimes it doesn't

Hi I have the following code snippet to train my model… It works for some time ad then the following error is thrown. I am unable to debug it, I read all the solutions for the same error, doesn’t help.

It starts to train … I’m showing what the actual variables are which are going inside the loss function (they are compatible dimension wise and other respect, I still don’t know why such error occurs)

Any clues what to do?

The line that is causing a problem is the line that prints cord3, sometimes pytorch doesn’t know how to properly turn it into a string. Maybe it contains NaNs, I don’t know.

I can only suggest that you remove that line and either everything else will work, or you will get nan loss, or you could get a more informative error.

Hey, thanks for the reply.
I tried removing it, but I am still a different error now (similar)

any clues?

It looks like the cuda code is doing stuff out of order and not properly coping with the synchronisation.

You can try adding torch.cuda.synchronize() before the training loop. I’m only guessing, I don’t have a GPU so I can’t say whether it will help.

doesn’t help, still the same… I also tried checking whether cord3 (and other variable is NaN or not just before crash by doing cord3 != cord3 ) and it doesn’t become NaN

Because of the asynchronous cuda calls, the stack trace sometimes points to a wrong line of code. Could you try to run your script with the following command: CUDA_LAUNCH_BLOCKING=1 python and post the stack trace again.

1 Like

Is there a way to run it in notebook, or I have to make a script and run it?

Oh, I usually don’t run notebooks, so just try to export it as a script. Would that work?

1 Like

thanks, I was able to debug. Thank you

1 Like