(Apologies if there is an existing similar topic. The search doesn’t seem to be working properly. The topics I could find via searching on Google did not seem to answer my question.)
I am parallelizing across multiple GPUs. I want to parallelize across non-contiguous device IDs (e.g., 0 and 2), as I am running another process on device 1. I can set CUDA_VISIBLE_DEVICES=0,2 and I can wrap my model in nn.DataParallel (torch.cuda.device_count() correctly returns 2). However, the DataParellel constructor by default assigns to list(range(# available devices) (see https://pytorch.org/docs/stable/_modules/torch/nn/parallel/data_parallel.html), which means that it will try to assign using IDs [0, 1]. This means it will never try to allocate to device 2 (and consequently run out of memory).
If I pass the DataParellel constructor the list of devices [0, 2], it throws an error as device ID 2 == device_count() (this is in cuda/__init__.py:292). Thus, it won’t let me assign to any GPU IDs >= the device count – if I want to use all available GPUs, their IDs must be contiguous.
Is there any way for me to force DataParellel to use device 2?
gpu ids are always from 0 to the number of gpu -1.
When you do CUDA_VISIBLE_DEVICES=0,2 then it remaps the ids to 0 and 1: setting 0 in your program will use physical 0 and setting 1 in your program will use physical 2.
Note that you could do CUDA_VISIBLE_DEVICES=2,0 and 0 would map to physical 2 and 1 would map to physical 0.
So your code already works and will use the devices specified by the macro
Ok, thanks! Then I guess the problem is that it’s not mapping to the 2nd GPU at all. When I didn’t specify device IDs (–> by default, DataParellel uses both 0 and 1, i.e., 0 and 2 globally), it threw an out-of-memory exception as the memory use of GPU 0 reached maximum, rather than using GPU 2 in addition (I was supervising with nvidia-smi).
I will look into this further and see if I can figure out why it’s not mapping to the 2nd GPU.
To update, my model (Module) was wrapped in another class with functions like get_loss which called the forward function. I didn’t realize DataParallel requires outputs of forward to be tensors, so I had to write a higher-level wrapper which can handle use of the model class when it’s being parallelized and when it’s not. So I solved this problem.