Pytorch DistributedDataParallel Help

I am currently using the nn.DataParellel to parallelize my model, but I am planning to switch to a framework that utilizes the nn.DistributedDataParallel with multiple GPU from a single system. Is there a tutorial out there or some information to help with this? I am a little confused with how to set up torch.distributed.init_process_group with only a single job on a single machine.


Please checkout this this note and this tutorial

BTW, for questions related to distributed training, please add a “distributed” tag, so that the team monitoring this tag can get back to you promptly.

Thanks, I’ll look into those and try that.