Non-blocking device to host transfer

sinafz · April 12, 2019, 6:17am

On cuda-semantics page, there is information on non-blocking host to device data transfer. Is it possible to do non-blocking host to device data transfer?

ptrblck · April 12, 2019, 10:08am

You can use non-blocking data transfers using the non_blocking=True argument in e.g. tensor = tensor.to(). This NVIDIA blog post gives you some information what’s going on under the hood.

In case I misunderstood your question, could you clarify it a bit?

sinafz · June 7, 2019, 5:48pm

Sorry @ptrblck for not following up on this. It dropped down in my backlog for a while. I think I’ve finally figured out how to do GPU->CPU asynchronously. Calling tensor.to() doesn’t seem to allow you to specify the output buffer or request output to be placed in pin-memory. So using tensor.to() for GPU->CPU transfer is always synchronous. However I think I was able to get async transfer by creating a pin-memory buffer in cpu and using pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True). Can you confirm this is the correct way to achieve asynchronous GPU->CPU data transfer?

Gun_Park · September 30, 2021, 6:18am

I’m concerned about this as well. Can anyone confirm?

GuillaumeTong · January 16, 2024, 3:31am

Here is a related discussion FYR:

Here is the TLDR:

gpu_tensor.to(device="cpu", non_blocking=True) will asynchronously copy a tensor from GPU to CPU pinned memory
As you point out, pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True) will asynchronously copy a tensor from GPU to an existing CPU pinned memory tensor
Once the asynchronous copy has been launched, you need to use torch.cuda.Stream.synchronize() or torch.cuda.synchronize() to ensure the copy has finished before using its content