All reducing tensors

Suppose I have a vector of type torch.int32. During all reduce operation, do all 32bits for each coordinate gets transmitted, irrespective of the value (at the coordinate)?

More specifically, I am interested in how we achieve higher speeds in reducing sparse tensors. (By sparse tensors I mean tensors wil large number of zeroes).

We currently support all_reduce on sparse tensors with the Gloo backend (for both CPU and CUDA tensors), but this is not yet supported with the NCCL backend.

Tensorflow all_reduces sparse tensors (tf.IndexedSlices), by all_gathering followed by tensor reduction.

Does PyTorch do the same (with Gloo backend) or it does something different under the hood?

It’s pretty similar - we all_gather the metadata, indices, and values, and then each node does a local sum of the sparse tensors. Here’s the implementation: https://github.com/pytorch/pytorch/blob/65bd38127a34d428915c88507878b4735edf005f/torch/lib/c10d/ProcessGroupGloo.cpp#L939