Hi all,
I am trying to run two operations concurrently on a single GPU. However, the operations are done sequentially.
I use the sample code shown below. However, my GPU profiler shows that torch.mm() on A and B are computed sequentially on GPU (see attached the picture at the bottom). Is this due to the kernel size? I wonder if anyone might have some clues about what might be the reason. BTW: I am using Volta GPU.
Thanks in advance.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
A = torch.rand(10000, 10000, device=device)
B = torch.rand(10000, 10000, device=device)
torch.cuda.synchronize()
for i in range(10):
with torch.cuda.stream(s1):
C = torch.mm(A, A)
with torch.cuda.stream(s2):
D = torch.mm(B, B)
torch.cuda.synchronize()