And, in line 88, the module DistributedDataParallel is used. When I searched for the same in the docs, I haven’t found anything. Possible to redirect me to it if any such doc exist for the module.
Else, would like to know what is the difference between the DataParallel and DistributedDataParallel modules.
Sorry for resurrecting this old thread. The answer above made some confusion with some folks I’ve talked to.
Distributed Data Parallel can very much be advantageous perf wise for single node multi-gpu runs. When run in a 1 gpu / process configuration Distributed Data Parallel can be beneficial as CPU based overheads are now spread across multiple processes.
Perf gains will especially be prominent in networks that have many small layers/operations. I primarily recommend to folks that they use single gpu / process Distributed Data Parallel over Data Parallel even for single node cases if they want to scale past 2 GPUs.
Can you please elaborate on “When run in a 1 gpu / process configuration Distributed Data Parallel can be beneficial as CPU based overheads are now spread across multiple processes”? Thanks!
Totally agree with you!
“I primarily recommend to folks that they use single gpu / process Distributed Data Parallel over Data Parallel even for single node cases if they want to scale past 2 GPUs.”