GeForce 3090 Hanging

I’m experiencing an intermittent issue with my GeForce 3090. The issue was occurring with nightly before 1.7 was released, and with the 1.7 version. It was also occurring with the nvidia driver 455.23 and after updating to 455.32.

For the release version, after installing 455.32, I installed PyTorch with:

conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

Sometimes the python process hangs after say the ten thousandth iteration or the 5 millionth. The issue occurs regardless the network that I am training (GAN, CNN, etc).

I don’t know that this is a PyTorch issue, but I am hopeful for suggestions on how to investigate what might be causing the hang so that I may report the issue to the appropriately. Any suggestions would be appreciated.

Are you using a single GPU or multiple ones?
Could you kill the process when it hangs and post the stack trace here, please?

Hi ptrblck!

Just a single GPU.

I would be happy to post a stack trace, but when I kill the process (control-c) it just quits. Once control-c didn’t quit the process, so I did a control-z to send it to the background and was able to kill it with kill -9. But in either case, no stack trace was shown.

Any suggestions on how to see the stack trace?

kill -9 wouldn’t show the stack trace.
Could you check for any XIDs in dmesg after the training is killed?

So like this?

dmesg | grep -i xid

I just ran dmesg and see a number of these type messages:

[269351.243437] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[269351.243442] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[269351.244286] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000001/00002000
[269351.244710] pcieport 0000:00:03.0: AER:    [ 0] RxErr  

By the way, thanks for all your contributions to this forum! I must have read 30+ threads with your help when searching for various help.

Is there another way besides using kill or control-c so that it shows the stack trace?

1 Like

You can send the process to the background via CTRL-Z and send a SIGHUP to the main process (if you are using DDP).
Do you see any XID fields with an identifier directly?
The posted output doesn’t show any of these ids.

OK thanks! I’m not sure what DDP is, I’ll look into it.

I don’t see any matches for XID. The entire output of dmesg are similar messages that I posted.

DDP is DistributedDataParallel, so a multi-GPU training, but as you’ve already mentioned you are using a single device, so skip this part please.

If the code hangs, are you seeing any GPU utilization in nvidia-smi?

Thanks!

I see the process is still allocated, but the utilization is 0%, and I don’t see any CPU activity going on either. Next time it happens I’ll take a screen shot of nvidia-smi and run dmesg looking for xid entries.

Would anything else I can do be insightful?

I’ll try a to use the signal method in this SO answer.

Might take a while, but when it happens again I’ll post the backtrace here.

I’m currently using nightly 1.8.0 dev20201022+cu110 with my 3090 on 455.23. On the rare occasion I will get some kind of dataloader related crash issue. However I don’t have any halting issues.

I am having sort of the same question.
My Geforce 1080 is hanging between those two print statements:

 print(f'test 3')
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        print(f'test 2')

Test 3 is printed. Test 2 not.
After Ctrl+C test 2 is printed and the net is training for a short moment.

Torch version 1.5.0

This issue seems like a multiprocessing issue, if you are using multiple workers.
Try to update to the latest PyTorch version and please create a new topic if it still doesn’t work.

I just want to follow up and report that I have not been able to capture a stack trace.

The last two times I’ve had any issue (after like a week of continuous use), instead of just hanging, the host became unresponsive.

I suspect that it is system stability issue related to possibly thermal overload.