The most reliable backend in distributed pytorch to support GPU training

Hi @teng-li and @apaszke,

I’m exploring to train a model on a cluster of multiple GPU machines. Each machine has 8 GPU cards. For example, I have 32 GPU cards for training a job on 4 nodes.

The most familiar backend for me is MPI, however, according to How to properly use distributed pytorch with infiniband support, MPI implementation doesn’t support GPU training and GLOO still have some problems.

Can you provide me some advice to train with multiple cards on multiple machines? What is the most reliable backend?

@Zrachel, are you using Infiniband? If so, we have developed “nccl” backend as well in current master branch and is available to use and it works well with Infiniband fairly reliable.

If you are using Ethernet, I think either “gloo” or “nccl” backend will work.

Thank you. Is there any examples using nccl?

And there might be some bugs here.