nn.DataParallel creates model replicas in each forward pass and thus needs to broadcast a lot of parameters. We generally recommend to use DistributedDataParallel with a single process per device for the fastest approach(also on a single node with multiple GPUs).
@gnadaf were you able to actually run the model by just adding nn.DataParallel(model)?
I can run my model just fine w/o DataParallel, but w/ DataParallel the training doesn’t proceed (kind of freezes).
nn.DataParallel certainly has advantages and it should speed up your training in some cases (try with a simple CNN + FC model).
However, as ptrblck mentioned the major disadvantage of nn.DataParallel is that it creates model replicas in each forward pass and thus needs to broadcast a lot of parameters.
Since most transformer models are huge (w/ millions of parameters), the advantage of increased speed (due to data parallelism) is overshadowed by the reduction in speed due to broadcasting.
Hence, DistributedDataParallel is the recommended way.