Matmul slows down when doing communication overlapping

phucnh · April 21, 2025, 8:05pm

The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This “wave quantization” can double the latency of the matmul kernels.

Hi, I wonder if there is a workaround for the issue described here [link] when doing comm/computation overlapping (TLDR: when overlapping communication and computation, matmul becomes slower) ?

fduwjj · April 23, 2025, 6:56pm

Maybe you want to allocate dedicated SMs for communication and limit the SM usage of communication so that you don’t see slowing in matmul that much?