What's the fastest setting for distributed training, and why is it?

I’ve seen that for the torch.distributed.init_process_group, we can set up the process communication by tcp or file, by giving right argument to init_method. So I wonder how to set this for the fast performance and why.

Maybe we can discuss on the following cases:

  • for single node, multi process(multi GPUs) . Use tcp://127.0.01:$port vs file:///tmp/somefile
  • for multi node. Use tcp://$ip:$port vs file:///share/nfs/somefile


Unless you have a very specific reason, I would always suggest using TCPStore which is much more reliable than FileStore. Having said that our stores are mostly used for setting up your distributed training environment. Once all ranks have established their communication via a CCL, stores play very little role in the performance of your training. So it is much more important to specify the correct parameters for your underlying CCL (e.g. NCCL, Gloo) than the type of store.

1 Like