The pytorch script get stuck and exit without any error

Smith_Jack · April 29, 2025, 7:48am

Hello. I am training my model on a single GPU (on a remote sever) and the script gets stuck and corrupts randomly without any error output. When I print the size of some tensors in the loss function, it seems that the scripts runs normally. GPT-4o says I can add torch.cuda.synchronize() after the cuda operations and add flush=True in the print function, but it does not help. Besides, if I run the script with mprof run, the script can also run normally. The script just like a Schrodinger Cat, if I try to observe it then the corruption does not occur.
How can I find out what’s wrong with my script.

PS: When the script gets stuck, the SSH client exits automatically and I cannot connect it again in a short time.

albanD · April 30, 2025, 7:00pm

Hey!

What you’re describing sounds like a machine that runs our of RAM. And so it hangs for a while trying to recover and start killing process in a hard way.
Can you check that RAM usage is ok on your machine?