Slowdown when using multiple cuda streams due to small grid size

I have a bunch of torch.mms that don’t depend on each other so I thought I’d split them up across two cuda streams to potentially speed them up as not all of them are big enough to saturate the GPU.

I’m running them with the following code (mms is a list of partial calls to torch.mm):

def parallel_run(mms, device):
    orig = torch.cuda.current_stream(device)
    s1 = torch.cuda.Stream(device)
    s2 = torch.cuda.Stream(device)
    s1.wait_stream(orig)
    s2.wait_stream(orig)

    with torch.cuda.stream(s1):
        for mm in mms[:len(mms)//2]:
            mm()
    with torch.cuda.stream(s2):
        for mm in mms[len(mms)//2:]:
            mm()

    orig.wait_stream(s1)
    orig.wait_stream(s2)

Everything works correctly and I can see the two halves are running in parallel according to Nsight but the second set of functions runs much slower than the first. Looking at Nsight, the grid sizes of the second set of kernel launches are all clamped at max 54 (half of 108, the max number of SMs). This causes a long tail of slow kernels they don’t utilize the full GPU even when the first set have finished, negating all performance gains.

The grid size can reach up to 324 when run in a single stream and for the kernels in the first stream.

I’m running torch 1.13.1, cuda 12, and using an A100.

Is there any way to avoid this? Thanks!

I’ve found a decently good workaround that still allows me to parallelize over multiple streams: allocate only the smaller ops to the secondary stream as they wouldn’t have taken the full grid anyway.

Still would like a way to disable this behavior in torch though!