Suppose I have a vector of type
torch.int32. During all reduce operation, do all 32bits for each coordinate gets transmitted, irrespective of the value (at the coordinate)?
More specifically, I am interested in how we achieve higher speeds in reducing sparse tensors. (By sparse tensors I mean tensors wil large number of zeroes).
We currently support
all_reduce on sparse tensors with the Gloo backend (for both CPU and CUDA tensors), but this is not yet supported with the NCCL backend.
all_reduces sparse tensors (tf.IndexedSlices), by
all_gathering followed by tensor reduction.
Does PyTorch do the same (with Gloo backend) or it does something different under the hood?
It’s pretty similar - we
all_gather the metadata, indices, and values, and then each node does a local sum of the sparse tensors. Here’s the implementation: https://github.com/pytorch/pytorch/blob/65bd38127a34d428915c88507878b4735edf005f/torch/lib/c10d/ProcessGroupGloo.cpp#L939