I want to train a large network using model parallelism on multiple machines (multiple GPUs per machine),
for that I am following this article
This article doesn’t set up any multi machine cluster, so how will it train on multiple machines? Also I am not able to understand following terms in my scenario,
I have already installed NCCL in all nodes. How can I make it work?