- If I set batch_size=4 and train with nn.DataParallel or nn.DistributedDataParallel on 8 GPUs, then what will be the batch-size and mini_batch_size: 4, 8, or 32?
batch_size var is usually a per-process concept. As
DataParallel is single-process multi-threads, setting
batch_size=4 will make 4 the real batch size. The per-thread batch-size will be
4/num_of_devices. However, as these threads accumulate grads into the same
param.grad field, the per-threads batch-size shouldn’t make any differences.
DistributedDataParallel (DDP), as it is multi-process training, if you set
batch_size=4 for each process, the real batch_size will be
4 * world_size. One caveat is that, DDP uses
AllReduce to calculate the average (instead of sum) gradients across processes.
- Can I use batch_size lower than number of GPUs, batch_size=4 for 8xGPUs (will it lead to error, or will be used only 4 GPUs or will be batch_size increased to 8 or 32)?
It should work, but will not fully utilize all devices. If
batch_size=4, IIUC, it can at most use 4 GPUs.
- I tried to train EfficientNet-L2 by using each of nn.DataParallel and nn.DistributedDataParallel, but with nn.DataParallel I can use batch_size 2x higher than with nn.DistributedDataParallel without CUDA Out of memory. Does nn.DistributedDataParallel spend 2x time more GPU memory than nn.DataParallel?
DDP allocates dedicated CUDA buffer as communication buckets, so it will use more CUDA memory than DP. But it is not 2X compared to DP. The total comm bucket size is the same as the model size.