GPU hangs when calling model.cuda()

youali · October 15, 2019, 11:16am

Hi,

I have this strange problem where the GPU hangs the minute model.cuda() is called.

GPU: V100 32GB
Nvidia Driver: 418.87.00
CUDA version: 10.1
Pytorch version: tested with 1.3.0 and 1.2.0.

it happens within the training script, but even in a python console, where I only create a torchvision model and try to send it to the GPU, and then the node hangs (the resource manager is Slurm if that’s relevant). I note that I don’t have this problem with other GPUs (notably K80).

Thanks.

spanev · October 15, 2019, 2:04pm

Hi @youali,

I would be tempted to thing that it is related to this issue, but you say that it also fails with 1.2.0.

Does it happen only with the torchvision model you are trying to send to the GPU or even with a toy network (let’s say, several Conv2d and BNs)?

alex.veuthey · October 15, 2019, 2:14pm

Apparently, there’s also an issue with 1.2 here: .cuda() Problem - I can't transfer objects to the GPU.

Could you try reinstalling from conda, as suggested here?

Bruno_Oliveira · October 15, 2019, 8:17pm

Apparently, there’s also an issue with 1.2 here: .cuda() Problem - I can’t transfer objects to the GPU.

Indeed, I had this problem while running PyTorch 1.2. It seems it was related to a Conda PyTorch install (version doesn’t seem to matter) and CUDA 10.1. This was fixed and all you need to do is reinstall it from Conda.

youali · October 17, 2019, 9:52am

Thanks for the answers, I was able to solve it by re-installing pytorch in the same node where the script will be run (with a V100 card). Even if I install it with cuda but not within a node with V100 card, the script hangs (but works fine for other / older cards), but after re-installing, it works fine across the GPUs.