Sparse AllReduce Performance With Large GPU Procesors

Sparse all-reduce has been implemented in Add sparse tensor allreduce by pietern · Pull Request #22036 · pytorch/pytorch · GitHub.
However in our case, when the gpu cluster scaled up to several hundred workers, high sparsification ratios still generate significant communication overheads, which even worst than DenseAllReduce。

Is it possible to optimize the communication volume of sparse all-reduce with a large number of workers?