Hi
I’m using CUDA streams to perform two operations concurrently on a GPU. However, my code with multiple streams is running slower than the one with single (default) stream. Here’s my code for both cases and streams
function runs twice as slow as the normal
.
def streams(X, dX, W1):
s = torch.cuda.Stream() # Create a new stream.
with torch.cuda.stream(s):
dZ1 = torch.matmul(dX, W1.T)
Z1 = torch.matmul(X, W1.T)
return Z1, dZ1
def normal(X, dX, W1):
Z1 = torch.matmul(X, W1.T)
dZ1 = torch.matmul(dX, W1.T)
return Z1, dZ1
In the above example code, I want to compute Z1
and dZ1
concurrently as they’re independent. I can’t understand why this is happening?