Run two operations concurrently on one GPU

Hi all,

I am trying to run two operations concurrently on a single GPU. However, the operations are done sequentially.

I use the sample code shown below. However, my GPU profiler shows that on A and B are computed sequentially on GPU (see attached the picture at the bottom). Is this due to the kernel size? I wonder if anyone might have some clues about what might be the reason. BTW: I am using Volta GPU.

Thanks in advance.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
A = torch.rand(10000, 10000, device=device)
B = torch.rand(10000, 10000, device=device)
for i in range(10):
    C =, A)
    D =, B)

You might use the complete GPU resources, which would disallow overlapping kernel execution.
Try to reduce the workload and check, if you see the overlapping execution.