CUDA stream not working

It depends on the kernel implementation and the overall occupancy. I.e. each SM has limited resources in shared memory, threads, and registers. If one of these resources is exhausted, there won’t be any way to fit another block onto the SM.
The GTC 2022 - How CUDA Programming Works talk by Stephen Jones, CUDA Architect, NVIDIA, might be interesting to you. Especially the section starting at 23:00.