Our school’s slurm server’s device is v100 with cuda 10.1. Lack of root user, I can not upgrade nvidia driver. The code able to run successfully in my pc(tesla p4 with cuda 11.6) can not run in server. Besides, I have tried kinds of torch versions and cuda version (<=10.1), but it always hangs when running model.to device.
Based on your output 3/4 GPUs seem to work while GPU0 reports an uncorrectable ECC error, so you might want to check the RAM of this device.
Oh! I can’t believe I didn’t notice the ecc problem in this GPU, causing me to spend days troubleshooting environmental issues. Thx very much.
Just to confirm my understanding: are you able to use the other GPUs in this PyTorch environment and it only fails on GPU0 (the one showing the ECC error)?
No, I can not choose which gpu to use. It is assigned by the slurm system.
I think you are right, because I use the same environment to run my code successfully weeks before.