Hello. I am training my model on a single GPU (on a remote sever) and the script gets stuck and corrupts randomly without any error output. When I print the size of some tensors in the loss function, it seems that the scripts runs normally. GPT-4o says I can add torch.cuda.synchronize()
after the cuda operations and add flush=True
in the print
function, but it does not help. Besides, if I run the script with mprof run
, the script can also run normally. The script just like a Schrodinger Cat, if I try to observe it then the corruption does not occur.
How can I find out what’s wrong with my script.
PS: When the script gets stuck, the SSH client exits automatically and I cannot connect it again in a short time.