CUDA_VISIBLE_DEVICES
is not really a solution for changing default device because it relies on setting the environment variable before the script runs. Perhaps torch.load
docs should mention that torch.cuda.set_device
is ignored by it and tensors are loaded to the same device they were saved against.
I had a distributed training (4 nodes each 8 GPUs) and for the life of me could not realize why I could run DistributedDataParallel
with 4 processes (each 8 GPU) but not with 32 process each using torch.cuda.set_device(gpus[0])
. Until I realized that torch.load
(a previous snapshot) ignores torch.cuda.set_device
!
Thanks to map_location
solution from here and here now I load to CPU first, and it works just fine now!