Across multiple nodes

nn.DataParallel requires that all the GPUs be on the same node.
nn.DistributedDataParallel - GPUs can be distributed across multiple nodes.
Explain what is meant by nodes?

1 Like

Nodes in this context refers to the physical machine or VM. nn.DataParallel performs model replication and data distribution within the GPU’s connected to a single machine. All operations happen within a single system process. There is no multi-process synchronization. nn.DistributedDataParallel can syncronize models across multiple machines. Each GPU on a machine can be configured to run on a single process and weights are synchronized after backgprop. This synchronization can also happen over the network across multiple machines. The former is a single line conversion. The latter requires you to startup processes yourself although torch.distributed.launch is a quick way to do this.

1 Like