Non-blocking device to host transfer

On cuda-semantics page, there is information on non-blocking host to device data transfer. Is it possible to do non-blocking host to device data transfer?

You can use non-blocking data transfers using the non_blocking=True argument in e.g. tensor = tensor.to(). This NVIDIA blog post gives you some information what’s going on under the hood.

In case I misunderstood your question, could you clarify it a bit?

Sorry @ptrblck for not following up on this. It dropped down in my backlog for a while. I think I’ve finally figured out how to do GPU->CPU asynchronously. Calling tensor.to() doesn’t seem to allow you to specify the output buffer or request output to be placed in pin-memory. So using tensor.to() for GPU->CPU transfer is always synchronous. However I think I was able to get async transfer by creating a pin-memory buffer in cpu and using pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True). Can you confirm this is the correct way to achieve asynchronous GPU->CPU data transfer?

4 Likes

I’m concerned about this as well. Can anyone confirm?