Model.to(device) hangs

SCUTVK · November 7, 2022, 4:07am

Our school’s slurm server’s device is v100 with cuda 10.1. Lack of root user, I can not upgrade nvidia driver. The code able to run successfully in my pc(tesla p4 with cuda 11.6) can not run in server. Besides, I have tried kinds of torch versions and cuda version (<=10.1), but it always hangs when running model.to device.

self.bert_regression_by_word_document.to(device=self.args['device'])

ptrblck · November 7, 2022, 5:54am

Based on your output 3/4 GPUs seem to work while GPU0 reports an uncorrectable ECC error, so you might want to check the RAM of this device.

SCUTVK · November 7, 2022, 6:14am

Oh! I can’t believe I didn’t notice the ecc problem in this GPU, causing me to spend days troubleshooting environmental issues. Thx very much.

ptrblck · November 7, 2022, 6:15am

Just to confirm my understanding: are you able to use the other GPUs in this PyTorch environment and it only fails on GPU0 (the one showing the ECC error)?

SCUTVK · November 7, 2022, 6:16am

No, I can not choose which gpu to use. It is assigned by the slurm system.

SCUTVK · November 7, 2022, 6:19am

I think you are right, because I use the same environment to run my code successfully weeks before.