I have this strange problem where the GPU hangs the minute model.cuda() is called.
GPU: V100 32GB
Nvidia Driver: 418.87.00
CUDA version: 10.1
Pytorch version: tested with 1.3.0 and 1.2.0.
it happens within the training script, but even in a python console, where I only create a torchvision model and try to send it to the GPU, and then the node hangs (the resource manager is Slurm if that’s relevant). I note that I don’t have this problem with other GPUs (notably K80).
Indeed, I had this problem while running PyTorch 1.2. It seems it was related to a Conda PyTorch install (version doesn’t seem to matter) and CUDA 10.1. This was fixed and all you need to do is reinstall it from Conda.
Thanks for the answers, I was able to solve it by re-installing pytorch in the same node where the script will be run (with a V100 card). Even if I install it with cuda but not within a node with V100 card, the script hangs (but works fine for other / older cards), but after re-installing, it works fine across the GPUs.