Should we set non_blocking to True?

Guo_Danding · November 16, 2024, 9:21am

Hello, I want to test whether using to with non_blocking=True actually achieves computation-communication overlap. Therefore, I set up a tensor transfer with to and a multiplication calculation between two tensors:

import torch

# Define the matrix size and operations
matrix_size = 10000
a = torch.randn((matrix_size, matrix_size), device='cuda:0')
b = torch.randn((matrix_size, matrix_size), device='cuda:0')

# Create timing events
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

c = torch.randn((5, matrix_size, matrix_size), device='cpu').pin_memory() 

# Test with non_blocking=False
start_event.record()
c.to('cuda:0', non_blocking=False)  # Synchronous
_ = torch.matmul(a, b)  # Perform a large matrix operation
end_event.record()
torch.cuda.synchronize()
time_non_blocking_false = start_event.elapsed_time(end_event)
print(f"Time with non_blocking=False: {time_non_blocking_false:.3f} ms")

c = torch.randn((5, matrix_size, matrix_size), device='cpu').pin_memory() 

# Test with non_blocking=True
start_event.record()
c.to('cuda:0', non_blocking=True)  # Asynchronous
_ = torch.matmul(a, b)  # Perform a large matrix operation
end_event.record()
torch.cuda.synchronize()
time_non_blocking_true = start_event.elapsed_time(end_event)
print(f"Time with non_blocking=True: {time_non_blocking_true:.3f} ms")

But the output is:

Time with non_blocking=False: 187.034 ms
Time with non_blocking=True: 186.822 ms

It seems that this didn’t reduce the time. What’s going on?