Hi,
I am trying to overlap host-to-device (H2D) memory transfers with computation using multiple CUDA streams in PyTorch, but I observe no overlap in practice. The following shows my code and the Nsight Systems profiling results.
Could someone help explain why the communication and computation do not overlap in this case? Thanks!
import torch
tensor_cpu1 = torch.randn(100000, 10000).to('cpu', non_blocking=True)
A = torch.randn(10000, 10000, device='cuda:0')
B = torch.randn(10000, 10000, device='cuda:0')
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()
with torch.cuda.stream(stream1):
for _ in range(8):
b = tensor_cpu1.to('cuda:0', non_blocking=True)
with torch.cuda.stream(stream2):
for _ in range(8):
C = torch.matmul(A, B)
torch.cuda.synchronize()
