Pytorch DistributedDataParallel Help

kleingeo · March 17, 2020, 5:03pm

I am currently using the nn.DataParellel to parallelize my model, but I am planning to switch to a framework that utilizes the nn.DistributedDataParallel with multiple GPU from a single system. Is there a tutorial out there or some information to help with this? I am a little confused with how to set up torch.distributed.init_process_group with only a single job on a single machine.

mrshenli · March 24, 2020, 1:28am

@kleingeo

Please checkout this this note and this tutorial

BTW, for questions related to distributed training, please add a “distributed” tag, so that the team monitoring this tag can get back to you promptly.

kleingeo · March 24, 2020, 5:55pm

Thanks, I’ll look into those and try that.