Pytorch nn.dataparallel

Michael_Lempart · May 2, 2023, 9:26am

Hi,

I have access to a GPU cluster and can use up to 4 GPU cards with 20GB each, or a single GPU with 80GB.

Iam wondering if training on multiple GPUs using nn.datapralllel is the same as increasing the batch-size?

As far as I understand, nn.DistributedDataParallel splits a mini-batch into several smaller mini-batches for multi GPU training. E.g. if my batch size is 16 and I would train on 4 GPUs, each GPU would get a mini-batch containing 4 samples.

Now Iam somewhat unsure how the the computation is done. Is e.g. the loss computed for each mini-batch of 4 samples and then averaged over the GPUs?
Then this would not be the same as computing the loss on a single GPU with batch size 16.

Or is the loss over the 4 GPUs computed in the same way as it would be on a single GPU?
Then what is preferable? Using 4 GPUs with a batch size of 4 each, or a single GPU with a batch size of 16?

Thanks in advance,

kind regards,

Mi

tiramisuNcustard · May 4, 2023, 4:27pm

It is recommended to use nn.DistributedDataParallel (DDP) over the nn.DataParallel.

In DDP, if you have K gpus. then DDP will create K processes, copy the model to each K gpus and train each batch/gpu (batch size will be determined based on your system RAM and individual gpu RAM) and then gather the gradients from each of the K gpu and update the model in each gpu with the averaged gradients and continue with the training.

You determine the batch_size per GPU. Ideally, all the gpus will be the same type i.e. all of them are RTX 3070 or all of them are RTX 3060.

Just for pure simplicity, my personal preference will be to select a single card with 80GB RAM. DDP is multiprocess not multithreaded. You need a CPU with enough cores so that lack of cores (remember, multiprocess) does not become a bottleneck. Context switching is costly.

Code written for a single process training (1 gpu) will work on most ML machines (*most because you will need to satisfy the required libraries and software dependencies).

If your future machines don’t have multiple GPUs then code written using DDP will need to be modified.