I’m experiencing an intermittent issue with my GeForce 3090. The issue was occurring with nightly before 1.7 was released, and with the 1.7 version. It was also occurring with the nvidia driver 455.23 and after updating to 455.32.
For the release version, after installing 455.32, I installed PyTorch with:
Sometimes the python process hangs after say the ten thousandth iteration or the 5 millionth. The issue occurs regardless the network that I am training (GAN, CNN, etc).
I don’t know that this is a PyTorch issue, but I am hopeful for suggestions on how to investigate what might be causing the hang so that I may report the issue to the appropriately. Any suggestions would be appreciated.
I would be happy to post a stack trace, but when I kill the process (control-c) it just quits. Once control-c didn’t quit the process, so I did a control-z to send it to the background and was able to kill it with kill -9. But in either case, no stack trace was shown.
You can send the process to the background via CTRL-Z and send a SIGHUP to the main process (if you are using DDP).
Do you see any XID fields with an identifier directly?
The posted output doesn’t show any of these ids.
I see the process is still allocated, but the utilization is 0%, and I don’t see any CPU activity going on either. Next time it happens I’ll take a screen shot of nvidia-smi and run dmesg looking for xid entries.
I’m currently using nightly 1.8.0 dev20201022+cu110 with my 3090 on 455.23. On the rare occasion I will get some kind of dataloader related crash issue. However I don’t have any halting issues.
This issue seems like a multiprocessing issue, if you are using multiple workers.
Try to update to the latest PyTorch version and please create a new topic if it still doesn’t work.