Non-blocking device to host transfer

On cuda-semantics page, there is information on non-blocking host to device data transfer. Is it possible to do non-blocking host to device data transfer?

You can use non-blocking data transfers using the non_blocking=True argument in e.g. tensor = tensor.to(). This NVIDIA blog post gives you some information what’s going on under the hood.

In case I misunderstood your question, could you clarify it a bit?

Sorry @ptrblck for not following up on this. It dropped down in my backlog for a while. I think I’ve finally figured out how to do GPU->CPU asynchronously. Calling tensor.to() doesn’t seem to allow you to specify the output buffer or request output to be placed in pin-memory. So using tensor.to() for GPU->CPU transfer is always synchronous. However I think I was able to get async transfer by creating a pin-memory buffer in cpu and using pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True). Can you confirm this is the correct way to achieve asynchronous GPU->CPU data transfer?

5 Likes

I’m concerned about this as well. Can anyone confirm?

Here is a related discussion FYR:

Here is the TLDR:

  1. gpu_tensor.to(device="cpu", non_blocking=True) will asynchronously copy a tensor from GPU to CPU pinned memory
  2. As you point out, pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True) will asynchronously copy a tensor from GPU to an existing CPU pinned memory tensor
  3. Once the asynchronous copy has been launched, you need to use torch.cuda.Stream.synchronize() or torch.cuda.synchronize() to ensure the copy has finished before using its content