This article doesn’t set up any multi machine cluster, so how will it train on multiple machines? Also I am not able to understand following terms in my scenario,
world size
rank
spawn
processes
process group
I have already installed NCCL in all nodes. How can I make it work?
The example script and README show how to setup multi-node training for ImageNet. You may also want to try out PyTorch Lightning which has a simple API for multi-node training:
In conclusion, single machine model parallelism can be done as shown in the article I listed in my question, multi node training without model parallelism (with DDP) is shown in the example listed by @conrad & multi node training with model parallelism can only be implemented using PyTorch RPC. Is it right @wayi ?
RPC is the only way to support model parallelism in PyTorch distributed training. There may be some higher level APIs in the future, but they are all RPCs under the hood.