The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This “wave quantization” can double the latency of the matmul kernels.
Hi, I wonder if there is a workaround for the issue described here [link] when doing comm/computation overlapping (TLDR: when overlapping communication and computation, matmul becomes slower) ?