Do DataParallel and DistributedDataParallel affect the batch size and GPU memory consumption?

mdc · September 23, 2020, 3:23am

Do DataParallel and DistributedDataParallel affect the batch size and GPU memory consumption?
(I use NCCL backend).

If I set batch_size=4 and train with nn.DataParallel or nn.DistributedDataParallel on 8 GPUs, then what will be the batch-size and mini_batch_size: 4, 8, or 32?
Can I use batch_size lower than number of GPUs, batch_size=4 for 8xGPUs (will it lead to error, or will be used only 4 GPUs or will be batch_size increased to 8 or 32)?
I tried to train EfficientNet-L2 by using each of nn.DataParallel and nn.DistributedDataParallel, but with nn.DataParallel I can use batch_size 2x higher than with nn.DistributedDataParallel without CUDA Out of memory. Does nn.DistributedDataParallel spend 2x time more GPU memory than nn.DataParallel?

mrshenli · October 5, 2020, 5:21pm

If I set batch_size=4 and train with nn.DataParallel or nn.DistributedDataParallel on 8 GPUs, then what will be the batch-size and mini_batch_size: 4, 8, or 32?

The batch_size var is usually a per-process concept. As DataParallel is single-process multi-threads, setting batch_size=4 will make 4 the real batch size. The per-thread batch-size will be 4/num_of_devices. However, as these threads accumulate grads into the same param.grad field, the per-threads batch-size shouldn’t make any differences.

For DistributedDataParallel (DDP), as it is multi-process training, if you set batch_size=4 for each process, the real batch_size will be 4 * world_size. One caveat is that, DDP uses AllReduce to calculate the average (instead of sum) gradients across processes.

Can I use batch_size lower than number of GPUs, batch_size=4 for 8xGPUs (will it lead to error, or will be used only 4 GPUs or will be batch_size increased to 8 or 32)?

It should work, but will not fully utilize all devices. If batch_size=4, IIUC, it can at most use 4 GPUs.

I tried to train EfficientNet-L2 by using each of nn.DataParallel and nn.DistributedDataParallel, but with nn.DataParallel I can use batch_size 2x higher than with nn.DistributedDataParallel without CUDA Out of memory. Does nn.DistributedDataParallel spend 2x time more GPU memory than nn.DataParallel?

DDP allocates dedicated CUDA buffer as communication buckets, so it will use more CUDA memory than DP. But it is not 2X compared to DP. The total comm bucket size is the same as the model size.

AMellinger · July 9, 2021, 1:33pm

Greetings! I’d like some clarifications on this. Is this response referring to DP or DDP? If DDP then isn’t batch_size per-process? Meaning if one sets batch_size=4 in the DataLoader then isn’t that 4 sample per process/gpu? How does this turn into ‘it can at most use 4 GPUs?’

I guess I have always been confused by the DDP statement of “The batch size should be larger than the number of GPUs used locally.” because we are setting the batch_size per process/gpu not batch_size for the entire sets of gpus in aggregate. Or does “batch size” have two different meanings?

Thanks for any help!

ishaan-mehta · April 14, 2022, 7:08pm

This is also something I have been confused about. I don’t understand how the number of GPUs should have any effect on batch size selection in DDP given that the specified batch size should be for each GPU/process. I would definitely appreciate some clarification if possible @mrshenli.

AMellinger:

mrshenli:

Can I use batch_size lower than number of GPUs, batch_size=4 for 8xGPUs (will it lead to error, or will be used only 4 GPUs or will be batch_size increased to 8 or 32)?

It should work, but will not fully utilize all devices. If batch_size=4, IIUC, it can at most use 4 GPUs.

Greetings! I’d like some clarifications on this. Is this response referring to DP or DDP? If DDP then isn’t batch_size per-process? Meaning if one sets batch_size=4 in the DataLoader then isn’t that 4 sample per process/gpu? How does this turn into ‘it can at most use 4 GPUs?’

I guess I have always been confused by the DDP statement of “The batch size should be larger than the number of GPUs used locally.” because we are setting the batch_size per process/gpu not batch_size for the entire sets of gpus in aggregate. Or does “batch size” have two different meanings?

Thanks for any help!