What is the difference between DataParallel and DistributedDataParallel?

I am going through this imagenet example: https://github.com/pytorch/examples/blob/master/imagenet/main.py

And, in line 88, the module DistributedDataParallel is used. When I searched for the same in the docs, I haven’t found anything. Possible to redirect me to it if any such doc exist for the module.

Else, would like to know what is the difference between the DataParallel and DistributedDataParallel modules.

3 Likes

DataParallel is for performing training on multiple GPUs, single machine.
DistributedDataParallel is useful when you want to use multiple machines.

19 Likes

Sorry for resurrecting this old thread. The answer above made some confusion with some folks I’ve talked to.

Distributed Data Parallel can very much be advantageous perf wise for single node multi-gpu runs. When run in a 1 gpu / process configuration Distributed Data Parallel can be beneficial as CPU based overheads are now spread across multiple processes.

Perf gains will especially be prominent in networks that have many small layers/operations. I primarily recommend to folks that they use single gpu / process Distributed Data Parallel over Data Parallel even for single node cases if they want to scale past 2 GPUs.

16 Likes

Can you please elaborate on “When run in a 1 gpu / process configuration Distributed Data Parallel can be beneficial as CPU based overheads are now spread across multiple processes”? Thanks!

3 Likes

Totally agree with you!
“I primarily recommend to folks that they use single gpu / process Distributed Data Parallel over Data Parallel even for single node cases if they want to scale past 2 GPUs.”