DataParallel() takes more time

gnadaf · October 30, 2020, 5:57pm

If I use DataParallel() in tranformer model (https://pytorch.org/tutorials/beginner/transformer_tutorial.html)

The time for each epoch increases as we increase the number of GPUs,

Example: time for one epoch
1 GPU = 42 sec - 44 sec
4 GPUs = 175 sec - 180 sec

Why performance is decreasing if we increase the number of GPUs, is it a drawback in DataParallel()?

ptrblck · October 31, 2020, 2:43am

nn.DataParallel creates model replicas in each forward pass and thus needs to broadcast a lot of parameters. We generally recommend to use DistributedDataParallel with a single process per device for the fastest approach(also on a single node with multiple GPUs).

Abhilash_Srivastava · October 31, 2020, 10:26am

@gnadaf were you able to actually run the model by just adding nn.DataParallel(model)?
I can run my model just fine w/o DataParallel, but w/ DataParallel the training doesn’t proceed (kind of freezes).

gnadaf · October 31, 2020, 1:55pm

@Abhilash_Srivastava
Yes, I am able to run my code, we need to add just nn.DataParallel(model),

gnadaf · October 31, 2020, 1:57pm

@ptrblck
Does it mean nn.DataParallel() has drawback?
Is there no advantages of using nn.DataParalllel?

Abhilash_Srivastava · October 31, 2020, 8:37pm

nn.DataParallel certainly has advantages and it should speed up your training in some cases (try with a simple CNN + FC model).
However, as ptrblck mentioned the major disadvantage of nn.DataParallel is that it creates model replicas in each forward pass and thus needs to broadcast a lot of parameters.
Since most transformer models are huge (w/ millions of parameters), the advantage of increased speed (due to data parallelism) is overshadowed by the reduction in speed due to broadcasting.
Hence, DistributedDataParallel is the recommended way.