I’m new to distributed training.
When I train with DistributedDataParallel do I get the functionality of DataParallel, meaning can I assume that on a single node if there is more than one GPU then all GPUs will be utilized on that node?
Yep, DistributedDataParallel (DDP) can utilize multiple GPUs on the same node, but it works differently than DataParallel (DP). DDP uses multiple processes, one process per GPU, while DP is single-process multi-thread.
See this page for the comparison between the two: https://pytorch.org/tutorials/beginner/dist_overview.html#data-parallel-training
and this to get started with DDP: https://pytorch.org/docs/stable/notes/ddp.html
Great, thanks for the answer and references