Gloo backend default device

Hi PyTorch experts,

I am trying to use torch.distributed package for my distributed training. The backend I am using is gloo.

Based on this doc: https://pytorch.org/docs/stable/distributed.html, gloo supports all_reduce on both CPU and GPU, but it seems there is no specific way to chose one over the other.

I am wondering, during training, does gloo perform all_reduce automatically based on the tensor’s device type? Like if the tensors are on GPU, then perform all_reduce on GPU; if the tensors are on CPU, perform it on CPU?

Also, when all_reduce is performed on GPU, does gloo fallback using nccl?

Thanks in advance!

I am wondering, during training, does gloo perform all_reduce automatically based on the tensor’s device type?

Yes, see https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupGloo.cpp#L720. Essentially, we check the input’s device type, and run the appropriate operation based on that.

Also, when all_reduce is performed on GPU, does gloo fallback using nccl?

This doesn’t happen, the GLOO backend can be built with CUDA and supports GPU operations (https://github.com/facebookincubator/gloo/blob/master/docs/cuda.md)

Thanks @rvarm1. This helps a lot!