Non-blocking device to host transfer

GuillaumeTong · January 16, 2024, 3:31am

Here is a related discussion FYR:

Here is the TLDR:

gpu_tensor.to(device="cpu", non_blocking=True) will asynchronously copy a tensor from GPU to CPU pinned memory
As you point out, pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True) will asynchronously copy a tensor from GPU to an existing CPU pinned memory tensor
Once the asynchronous copy has been launched, you need to use torch.cuda.Stream.synchronize() or torch.cuda.synchronize() to ensure the copy has finished before using its content