CUDA stream not working

I try to use in CUDA stream in pytorch with a multi-branch network. However, it seems these branches are still executed sequentially. The code snippt is like,

def forward(x):
    with cuda.stream(s1):
        a = f1(x)
    with cuda.stream(s2):
        b = f2(x)

Do anyone have some experiences about this?

If not enough compute resources are available, the kernels would be launches sequentially once enough SMs are free.

Thank you for your anser. I’ve tried some really small networks in 2080Ti, but it still not working.
E.g.,
f1 = nn.Sequential(nn.Linear(1024, 10), nn.Linear(10, 10))
f2 = nn.Sequential(nn.Linear(1024, 10), nn.Linear(10, 10))

Or is there any way to figure out what size the gpu can support?

It depends on the kernel implementation and the overall occupancy. I.e. each SM has limited resources in shared memory, threads, and registers. If one of these resources is exhausted, there won’t be any way to fit another block onto the SM.
The GTC 2022 - How CUDA Programming Works talk by Stephen Jones, CUDA Architect, NVIDIA, might be interesting to you. Especially the section starting at 23:00.