Suppose I have a vector of type torch.int32. During all reduce operation, do all 32bits for each coordinate gets transmitted, irrespective of the value (at the coordinate)?
More specifically, I am interested in how we achieve higher speeds in reducing sparse tensors. (By sparse tensors I mean tensors wil large number of zeroes).
We currently support all_reduce on sparse tensors with the Gloo backend (for both CPU and CUDA tensors), but this is not yet supported with the NCCL backend.