I have this strange problem where the GPU hangs the minute
model.cuda() is called.
- GPU: V100 32GB
- Nvidia Driver: 418.87.00
- CUDA version: 10.1
- Pytorch version: tested with 1.3.0 and 1.2.0.
it happens within the training script, but even in a python console, where I only create a torchvision model and try to send it to the GPU, and then the node hangs (the resource manager is Slurm if that’s relevant). I note that I don’t have this problem with other GPUs (notably K80).
I would be tempted to thing that it is related to this issue, but you say that it also fails with 1.2.0.
Does it happen only with the torchvision model you are trying to send to the GPU or even with a toy network (let’s say, several Conv2d and BNs)?
Apparently, there’s also an issue with 1.2 here: .cuda() Problem - I can't transfer objects to the GPU.
Could you try reinstalling from conda, as suggested here?
Apparently, there’s also an issue with 1.2 here: .cuda() Problem - I can’t transfer objects to the GPU.
Indeed, I had this problem while running PyTorch 1.2. It seems it was related to a Conda PyTorch install (version doesn’t seem to matter) and CUDA 10.1. This was fixed and all you need to do is reinstall it from Conda.
Thanks for the answers, I was able to solve it by re-installing pytorch in the same node where the script will be run (with a V100 card). Even if I install it with cuda but not within a node with V100 card, the script hangs (but works fine for other / older cards), but after re-installing, it works fine across the GPUs.