Our school’s slurm server’s device is v100 with cuda 10.1. Lack of root user, I can not upgrade nvidia driver. The code able to run successfully in my pc(tesla p4 with cuda 11.6) can not run in server. Besides, I have tried kinds of torch versions and cuda version (<=10.1), but it always hangs when running model.to device.
Just to confirm my understanding: are you able to use the other GPUs in this PyTorch environment and it only fails on GPU0 (the one showing the ECC error)?