Hi PyTorch experts,
I am trying to use torch.distributed package for my distributed training. The backend I am using is gloo.
Based on this doc: https://pytorch.org/docs/stable/distributed.html, gloo supports all_reduce on both CPU and GPU, but it seems there is no specific way to chose one over the other.
I am wondering, during training, does gloo perform all_reduce automatically based on the tensor’s device type? Like if the tensors are on GPU, then perform all_reduce on GPU; if the tensors are on CPU, perform it on CPU?
Also, when all_reduce is performed on GPU, does gloo fallback using nccl?
Thanks in advance!