CUDA Streams make code slower

Hi

I’m using CUDA streams to perform two operations concurrently on a GPU. However, my code with multiple streams is running slower than the one with single (default) stream. Here’s my code for both cases and streams function runs twice as slow as the normal.

def streams(X, dX, W1):
    s = torch.cuda.Stream()  # Create a new stream.
    with torch.cuda.stream(s):
        dZ1 = torch.matmul(dX, W1.T)
    Z1 = torch.matmul(X, W1.T)
        
    return Z1, dZ1
def normal(X, dX, W1):    
    Z1 = torch.matmul(X, W1.T)
    dZ1 = torch.matmul(dX, W1.T)
    return Z1, dZ1

In the above example code, I want to compute Z1 and dZ1 concurrently as they’re independent. I can’t understand why this is happening?

Could you create a profile of this workload (e.g. via the PyTorch profiler or Nsight Systems) and share it here, please?