NCCL backend and `init_method` with MPI

dashesy · June 8, 2018, 4:51pm

I am trying to understand what is the reason that init_method is needed distributed package.

NCCL2 only needs a way to broadcast the master’s ncclGetUniqueId among the nodes and suggests MPI_BCast which I have used and works. So why there is no MPI:// for init_method that seems like the more common choice.

All other init methods (even File) seem to rely on a master address and port, but that will be difficult if the ports are not open among nodes, or if there are more than one network interfaces (NICs). I now find that file:// may hang if there are multiple adapters that some do not resolve (docker, …)

Is there a way to have a custom init_method? so that I can use MPI (that is already set up), I want to use MPI for init, not for backend

I created a feature request here.