On cuda-semantics page, there is information on non-blocking host to device data transfer. Is it possible to do non-blocking host to device data transfer?
You can use non-blocking data transfers using the non_blocking=True
argument in e.g. tensor = tensor.to()
. This NVIDIA blog post gives you some information what’s going on under the hood.
In case I misunderstood your question, could you clarify it a bit?
Sorry @ptrblck for not following up on this. It dropped down in my backlog for a while. I think I’ve finally figured out how to do GPU->CPU asynchronously. Calling tensor.to()
doesn’t seem to allow you to specify the output buffer or request output to be placed in pin-memory. So using tensor.to()
for GPU->CPU transfer is always synchronous. However I think I was able to get async transfer by creating a pin-memory buffer in cpu and using pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True)
. Can you confirm this is the correct way to achieve asynchronous GPU->CPU data transfer?
I’m concerned about this as well. Can anyone confirm?
Here is a related discussion FYR:
Here is the TLDR:
gpu_tensor.to(device="cpu", non_blocking=True)
will asynchronously copy a tensor from GPU to CPU pinned memory- As you point out,
pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True)
will asynchronously copy a tensor from GPU to an existing CPU pinned memory tensor - Once the asynchronous copy has been launched, you need to use
torch.cuda.Stream.synchronize()
ortorch.cuda.synchronize()
to ensure the copy has finished before using its content