Hello, I want to test whether using to
with non_blocking=True
actually achieves computation-communication overlap. Therefore, I set up a tensor transfer with to
and a multiplication calculation between two tensors:
import torch
# Define the matrix size and operations
matrix_size = 10000
a = torch.randn((matrix_size, matrix_size), device='cuda:0')
b = torch.randn((matrix_size, matrix_size), device='cuda:0')
# Create timing events
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
c = torch.randn((5, matrix_size, matrix_size), device='cpu').pin_memory()
# Test with non_blocking=False
start_event.record()
c.to('cuda:0', non_blocking=False) # Synchronous
_ = torch.matmul(a, b) # Perform a large matrix operation
end_event.record()
torch.cuda.synchronize()
time_non_blocking_false = start_event.elapsed_time(end_event)
print(f"Time with non_blocking=False: {time_non_blocking_false:.3f} ms")
c = torch.randn((5, matrix_size, matrix_size), device='cpu').pin_memory()
# Test with non_blocking=True
start_event.record()
c.to('cuda:0', non_blocking=True) # Asynchronous
_ = torch.matmul(a, b) # Perform a large matrix operation
end_event.record()
torch.cuda.synchronize()
time_non_blocking_true = start_event.elapsed_time(end_event)
print(f"Time with non_blocking=True: {time_non_blocking_true:.3f} ms")
But the output is:
Time with non_blocking=False: 187.034 ms
Time with non_blocking=True: 186.822 ms
It seems that this didn’t reduce the time. What’s going on?