Kernels launched to different cuda streams serialized

Hi, I just found the same problems. Do you have any solution now? Besides, when I use torch.autograd.profiler.emit_nvtx(), nvvp only shows the default streams, no other streams.