Run two operations concurrently on one GPU

Hi all,

I am trying to run two operations concurrently on a single GPU. However, the operations are done sequentially.

I use the sample code shown below. However, my GPU profiler shows that torch.mm() on A and B are computed sequentially on GPU (see attached the picture at the bottom). Is this due to the kernel size? I wonder if anyone might have some clues about what might be the reason. BTW: I am using Volta GPU.

Thanks in advance.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
A = torch.rand(10000, 10000, device=device)
B = torch.rand(10000, 10000, device=device)
torch.cuda.synchronize()
for i in range(10):
  with torch.cuda.stream(s1):
    C = torch.mm(A, A)
  with torch.cuda.stream(s2):
    D = torch.mm(B, B)
torch.cuda.synchronize()

You might use the complete GPU resources, which would disallow overlapping kernel execution.
Try to reduce the workload and check, if you see the overlapping execution.