CUDA stream not working

beann · July 4, 2022, 8:13am

I try to use in CUDA stream in pytorch with a multi-branch network. However, it seems these branches are still executed sequentially. The code snippt is like,

def forward(x):
    with cuda.stream(s1):
        a = f1(x)
    with cuda.stream(s2):
        b = f2(x)

Do anyone have some experiences about this?

ptrblck · July 4, 2022, 7:20pm

If not enough compute resources are available, the kernels would be launches sequentially once enough SMs are free.

beann · July 6, 2022, 2:13am

Thank you for your anser. I’ve tried some really small networks in 2080Ti, but it still not working.
E.g.,
f1 = nn.Sequential(nn.Linear(1024, 10), nn.Linear(10, 10))
f2 = nn.Sequential(nn.Linear(1024, 10), nn.Linear(10, 10))

Or is there any way to figure out what size the gpu can support?

ptrblck · July 6, 2022, 3:38am

It depends on the kernel implementation and the overall occupancy. I.e. each SM has limited resources in shared memory, threads, and registers. If one of these resources is exhausted, there won’t be any way to fit another block onto the SM.
The GTC 2022 - How CUDA Programming Works talk by Stephen Jones, CUDA Architect, NVIDIA, might be interesting to you. Especially the section starting at 23:00.