No overlap between communication and computation across CUDA streams in PyTorch

Hi,

I am trying to overlap host-to-device (H2D) memory transfers with computation using multiple CUDA streams in PyTorch, but I observe no overlap in practice. The following shows my code and the Nsight Systems profiling results.

Could someone help explain why the communication and computation do not overlap in this case? Thanks!

import torch

tensor_cpu1 = torch.randn(100000, 10000).to('cpu', non_blocking=True)

A = torch.randn(10000, 10000, device='cuda:0')
B = torch.randn(10000, 10000, device='cuda:0')

stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

with torch.cuda.stream(stream1):
    for _ in range(8):
        b = tensor_cpu1.to('cuda:0', non_blocking=True)

with torch.cuda.stream(stream2):
    for _ in range(8):
        C = torch.matmul(A, B)

torch.cuda.synchronize()

The CPUTensor is not in pinned host RAM, so try to use tensor_cpu1 = torch.randn(…, pin_memory=True) instead.