Matmul slows down when doing communication overlapping

The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This “wave quantization” can double the latency of the matmul kernels.

Hi, I wonder if there is a workaround for the issue described here [link] when doing comm/computation overlapping (TLDR: when overlapping communication and computation, matmul becomes slower) ?

Maybe you want to allocate dedicated SMs for communication and limit the SM usage of communication so that you don’t see slowing in matmul that much?