Multi-node (multiple GPUs per node) training with model parallelism

I want to train a large network using model parallelism on multiple machines (multiple GPUs per machine),
for that I am following this article

This article doesn’t set up any multi machine cluster, so how will it train on multiple machines? Also I am not able to understand following terms in my scenario,

world size
process group

I have already installed NCCL in all nodes. How can I make it work?

This is good place to start:

The example script and README show how to setup multi-node training for ImageNet. You may also want to try out PyTorch Lightning which has a simple API for multi-node training:

1 Like

If you want to explore model parallelism in a distributed environment, you need to use Distributed RPC framework.

The tutorial page of DDP + RPC can be found here:

This example is perfect. Though will it work with model parallelism?

That example shows how to use DDP on multiple nodes, but model parallelism requires RPC in PyTorch.

In conclusion, single machine model parallelism can be done as shown in the article I listed in my question, multi node training without model parallelism (with DDP) is shown in the example listed by @conrad & multi node training with model parallelism can only be implemented using PyTorch RPC. Is it right @wayi ?

You are totally right!

RPC is the only way to support model parallelism in PyTorch distributed training. There may be some higher level APIs in the future, but they are all RPCs under the hood.