I am currently using the nn.DataParellel
to parallelize my model, but I am planning to switch to a framework that utilizes the nn.DistributedDataParallel
with multiple GPU from a single system. Is there a tutorial out there or some information to help with this? I am a little confused with how to set up torch.distributed.init_process_group
with only a single job on a single machine.
Please checkout this this note and this tutorial
BTW, for questions related to distributed training, please add a “distributed” tag, so that the team monitoring this tag can get back to you promptly.
Thanks, I’ll look into those and try that.