I’m exploring to train a model on a cluster of multiple GPU machines. Each machine has 8 GPU cards. For example, I have 32 GPU cards for training a job on 4 nodes.
@Zrachel, are you using Infiniband? If so, we have developed “nccl” backend as well in current master branch and is available to use and it works well with Infiniband fairly reliable.
If you are using Ethernet, I think either “gloo” or “nccl” backend will work.