When using the following code to overlap H2D (host-to-device) and D2H (device-to-host) operations in an offload stream with operations in other streams, profiling reveals that item()
blocks CPU threads until the last D2H operation completes. This behavior seems unexpected because cudaStreamSynchronize
should only synchronize the current stream and shouldn’t affect other streams.
I’d like to understand:
- What causes this synchronization behavior?
- How can I avoid this unintended synchronization?
Here’s my test code:
offload_stream = torch.cuda.Stream()
def test_stream():
tensor_gpu = torch.ones((16384, 16384), device='cuda')
tensor_cpu = torch.empty_like(tensor_gpu, device='cpu', pin_memory=True)
seqlens = torch.ones((1,10240), device='cuda')
a = torch.ones((10240,10240), device='cuda')
seqlens = torch.matmul(seqlens, a)
for _ in range(20):
with torch.cuda.stream(offload_stream):
tensor_cpu.copy_(tensor_gpu, non_blocking=True)
tensor_gpu.copy_(tensor_cpu, non_blocking=True)
tensor_gpu.record_stream(offload_stream)
for _ in range(20):
seqlens = torch.matmul(seqlens, a)
seqlens[0][0].item()
torch.cuda.current_stream().wait_stream(offload_stream)
and trace:
This test execute on NVIDIA-H20 devices.
Thank you for your insights!