I am trying to understand what is the reason that
init_method is needed distributed package.
NCCL2 only needs a way to broadcast the master’s
ncclGetUniqueId among the nodes and suggests
MPI_BCast which I have used and works. So why there is no
init_method that seems like the more common choice.
All other init methods (even File) seem to rely on a master address and port, but that will be difficult if the ports are not open among nodes, or if there are more than one network interfaces (NICs). I now find that
file:// may hang if there are multiple adapters that some do not resolve (docker, …)
Is there a way to have a custom
init_method? so that I can use MPI (that is already set up), I want to use
MPI for init, not for backend
I created a feature request here.