DataParallel() takes more time

If I use DataParallel() in tranformer model (

The time for each epoch increases as we increase the number of GPUs,

Example: time for one epoch
1 GPU = 42 sec - 44 sec
4 GPUs = 175 sec - 180 sec

Why performance is decreasing if we increase the number of GPUs, is it a drawback in DataParallel()?

1 Like

nn.DataParallel creates model replicas in each forward pass and thus needs to broadcast a lot of parameters. We generally recommend to use DistributedDataParallel with a single process per device for the fastest approach(also on a single node with multiple GPUs).

@gnadaf were you able to actually run the model by just adding nn.DataParallel(model)?
I can run my model just fine w/o DataParallel, but w/ DataParallel the training doesn’t proceed (kind of freezes).

Yes, I am able to run my code, we need to add just nn.DataParallel(model),

Does it mean nn.DataParallel() has drawback?
Is there no advantages of using nn.DataParalllel?

nn.DataParallel certainly has advantages and it should speed up your training in some cases (try with a simple CNN + FC model).
However, as ptrblck mentioned the major disadvantage of nn.DataParallel is that it creates model replicas in each forward pass and thus needs to broadcast a lot of parameters.
Since most transformer models are huge (w/ millions of parameters), the advantage of increased speed (due to data parallelism) is overshadowed by the reduction in speed due to broadcasting.
Hence, DistributedDataParallel is the recommended way.