How to implement distributed model parallel using torch.distributed

xdwang0726 · September 17, 2020, 2:29pm

I am trying to implement model parallel using torch.distributed(DistributedDataParallel), and I am wondering is there a tutorial for that (single node multiple GPUs)? I know nn.DataParallel is easier to use, however, I use another package that only support torch.distributed. Thanks!

mrshenli · September 17, 2020, 2:59pm

Yep, here is the tutorial: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#combine-ddp-with-model-parallelism

It should work if you place different layers of the model on different GPUs and pass it to DDP. DDP should be able to detect it is a multi-GPU model. One caveat is that, please make sure no GPUs are shared across processes.

xdwang0726 · September 20, 2020, 11:29am

I followed the tutorial online, however I got the error message RuntimeError: Model replicas must have an equal number of parameters. in model = torch.nn.parallel.DistributedDataParallel(model)
Any idea what might cause this issue? Thanks!

pritamdamania87 · September 23, 2020, 2:59am

@xdwang0726 Do you see this error when using the code in the tutorial? Or is it some custom code based on the tutorial? If it is the latter, could you provide a minimal repro of the issue?