loss.backward()
deadlocks for me quite frequently (using CPU, no distributed mode). Unfortunately, I don’t have root / access to gdb. Is there any way to still debug / trace this?
Is it possible to enable any tracing / logging of autograd backward?
loss.backward()
deadlocks for me quite frequently (using CPU, no distributed mode). Unfortunately, I don’t have root / access to gdb. Is there any way to still debug / trace this?
Is it possible to enable any tracing / logging of autograd backward?
You could try to set export TORCH_SHOW_CPP_STACKTRACES=1
, run your script until it hangs, and kill it e.g. via SIGHUP
. You might be able to see the stacktraces in the terminal which could point to the hanging line of code. I haven’t tried this approach as gdb
is available in my setup, but it might be worth a try.