Torch.cuda.set_device is ignored by torch.load

CUDA_VISIBLE_DEVICES is not really a solution for changing default device because it relies on setting the environment variable before the script runs. Perhaps torch.load docs should mention that torch.cuda.set_device is ignored by it and tensors are loaded to the same device they were saved against.

I had a distributed training (4 nodes each 8 GPUs) and for the life of me could not realize why I could run DistributedDataParallel with 4 processes (each 8 GPU) but not with 32 process each using torch.cuda.set_device(gpus[0]). Until I realized that torch.load (a previous snapshot) ignores torch.cuda.set_device!
Thanks to map_location solution from here and here now I load to CPU first, and it works just fine now!