Can I set unequal load on 2 GPUs using nn.DataParallel?

Hello all,
I am doing distributed training using model = nn.DataParallel(model, device_ids=[0, 1]). Is there any option to indicate how much fraction of memory usage should which GPU or something like that? Because both the GPUs which I have might not be equally occupied and hence, when I do distributed training, one might OOM.
Thanks,
Megh

I don’t believe there is any such option. In DataParallel (and DistributedDataParallel), every GPU will have a replica of the model for local training, and every GPU will see input batches of the same size. Thus the amount of memory used by each rank (each GPU) in distributed training is approximately the same. If one of your GPUs is OOMing, you can try to:

  • reduce the batch size (though this will reduce the batch size for every rank, even on the GPU with enough memory)
  • use an optimizer that stores less local state (such as SGD as opposed to Adam)

Thank you @osalpekar. But, if the batch size remains the same across the multiple GPUs, where is the distribution of training happening? I mean which part of the memory load on a single GPU is getting split across multiple GPUs? Please let me know if I am missing something.

Each Data Loader pairs with one DDP instance. So if you define a batch size of 64 in your data loader, each replica trains on a size-64 batch. When training on 2 GPUs, this essentially means you train on 64*2=128 samples in one iteration. Each replica performs the forward pass independently on their separate batches, then in the backward pass, they communicate gradients computed with each other and take the average of them, so each replica has gradients as if they trained on an entire 128-sample batch themselves. Finally, each replica performs the optimization step using the averaged gradients, so they end up with the same model weights at the end of each iteration.

In essence, the speed-up comes from the fact that with n GPUs, you are able to train on n times as much data in each iteration. For this reason, the memory load on a single-node model and a distributed model are essentially the same (there may be small overheads for synchronizing the gradients but these are unlikely to influence which model/hyperparameters are chosen).

1 Like

Thanks for the detailed explanation! That makes sense.