I am trying to understand what is the reason that init_method
is needed distributed package.
NCCL2 only needs a way to broadcast the master’s ncclGetUniqueId
among the nodes and suggests MPI_BCast
which I have used and works. So why there is no MPI://
for init_method
that seems like the more common choice.
All other init methods (even File) seem to rely on a master address and port, but that will be difficult if the ports are not open among nodes, or if there are more than one network interfaces (NICs). I now find that file://
may hang if there are multiple adapters that some do not resolve (docker, …)
Is there a way to have a custom init_method
? so that I can use MPI (that is already set up), I want to use MPI
for init, not for backend
I created a feature request here.