DataParellel with non-contiguous GPU ids

(Apologies if there is an existing similar topic. The search doesn’t seem to be working properly. The topics I could find via searching on Google did not seem to answer my question.)

I am parallelizing across multiple GPUs. I want to parallelize across non-contiguous device IDs (e.g., 0 and 2), as I am running another process on device 1. I can set CUDA_VISIBLE_DEVICES=0,2 and I can wrap my model in nn.DataParallel (torch.cuda.device_count() correctly returns 2). However, the DataParellel constructor by default assigns to list(range(# available devices) (see https://pytorch.org/docs/stable/_modules/torch/nn/parallel/data_parallel.html), which means that it will try to assign using IDs [0, 1]. This means it will never try to allocate to device 2 (and consequently run out of memory).

If I pass the DataParellel constructor the list of devices [0, 2], it throws an error as device ID 2 == device_count() (this is in cuda/__init__.py:292). Thus, it won’t let me assign to any GPU IDs >= the device count – if I want to use all available GPUs, their IDs must be contiguous.

Is there any way for me to force DataParellel to use device 2?

Thanks!

Hi,

gpu ids are always from 0 to the number of gpu -1.
When you do CUDA_VISIBLE_DEVICES=0,2 then it remaps the ids to 0 and 1: setting 0 in your program will use physical 0 and setting 1 in your program will use physical 2.
Note that you could do CUDA_VISIBLE_DEVICES=2,0 and 0 would map to physical 2 and 1 would map to physical 0.

So your code already works and will use the devices specified by the macro :slight_smile:

Ok, thanks! Then I guess the problem is that it’s not mapping to the 2nd GPU at all. When I didn’t specify device IDs (–> by default, DataParellel uses both 0 and 1, i.e., 0 and 2 globally), it threw an out-of-memory exception as the memory use of GPU 0 reached maximum, rather than using GPU 2 in addition (I was supervising with nvidia-smi).

I will look into this further and see if I can figure out why it’s not mapping to the 2nd GPU.

To update, my model (Module) was wrapped in another class with functions like get_loss which called the forward function. I didn’t realize DataParallel requires outputs of forward to be tensors, so I had to write a higher-level wrapper which can handle use of the model class when it’s being parallelized and when it’s not. So I solved this problem.