I am trying to implement model parallel using torch.distributed(DistributedDataParallel), and I am wondering is there a tutorial for that (single node multiple GPUs)? I know nn.DataParallel is easier to use, however, I use another package that only support torch.distributed. Thanks!
Yep, here is the tutorial: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#combine-ddp-with-model-parallelism
It should work if you place different layers of the model on different GPUs and pass it to DDP. DDP should be able to detect it is a multi-GPU model. One caveat is that, please make sure no GPUs are shared across processes.
I followed the tutorial online, however I got the error message
RuntimeError: Model replicas must have an equal number of parameters. in
model = torch.nn.parallel.DistributedDataParallel(model)
Any idea what might cause this issue? Thanks!
@xdwang0726 Do you see this error when using the code in the tutorial? Or is it some custom code based on the tutorial? If it is the latter, could you provide a minimal repro of the issue?