nn.DataParallel
requires that all the GPUs be on the same node.
nn.DistributedDataParallel
- GPUs can be distributed across multiple nodes.
Explain what is meant by nodes?
1 Like
Nodes in this context refers to the physical machine or VM. nn.DataParallel
performs model replication and data distribution within the GPU’s connected to a single machine. All operations happen within a single system process. There is no multi-process synchronization. nn.DistributedDataParallel
can syncronize models across multiple machines. Each GPU on a machine can be configured to run on a single process and weights are synchronized after backgprop. This synchronization can also happen over the network across multiple machines. The former is a single line conversion. The latter requires you to startup processes yourself although torch.distributed.launch
is a quick way to do this.
1 Like