Multi-node (multiple GPUs per node) training with model parallelism

matrix · February 11, 2021, 8:26am

I want to train a large network using model parallelism on multiple machines (multiple GPUs per machine),
for that I am following this article

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#combine-ddp-with-model-parallelism

This article doesn’t set up any multi machine cluster, so how will it train on multiple machines? Also I am not able to understand following terms in my scenario,

world size
rank
spawn
processes
process group

I have already installed NCCL in all nodes. How can I make it work?

conrad · February 11, 2021, 3:38pm

This is good place to start:

The example script and README show how to setup multi-node training for ImageNet. You may also want to try out PyTorch Lightning which has a simple API for multi-node training:

https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html

wayi · February 12, 2021, 8:50am

If you want to explore model parallelism in a distributed environment, you need to use Distributed RPC framework.

The tutorial page of DDP + RPC can be found here:
https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html#

matrix · February 12, 2021, 8:55am

This example is perfect. Though will it work with model parallelism?

wayi · February 12, 2021, 9:17am

That example shows how to use DDP on multiple nodes, but model parallelism requires RPC in PyTorch.

matrix · February 12, 2021, 9:29am

In conclusion, single machine model parallelism can be done as shown in the article I listed in my question, multi node training without model parallelism (with DDP) is shown in the example listed by @conrad & multi node training with model parallelism can only be implemented using PyTorch RPC. Is it right @wayi ?

wayi · February 12, 2021, 10:22am

You are totally right!

RPC is the only way to support model parallelism in PyTorch distributed training. There may be some higher level APIs in the future, but they are all RPCs under the hood.