CUDA_VISIBLE_DEVICES is not really a solution for changing default device because it relies on setting the environment variable before the script runs. Perhaps
torch.load docs should mention that
torch.cuda.set_device is ignored by it and tensors are loaded to the same device they were saved against.
I had a distributed training (4 nodes each 8 GPUs) and for the life of me could not realize why I could run
DistributedDataParallel with 4 processes (each 8 GPU) but not with 32 process each using
torch.cuda.set_device(gpus). Until I realized that
torch.load (a previous snapshot) ignores
map_location solution from here and here now I load to CPU first, and it works just fine now!