Non-blocking device to host transfer

sinafz · June 7, 2019, 5:48pm

Sorry @ptrblck for not following up on this. It dropped down in my backlog for a while. I think I’ve finally figured out how to do GPU->CPU asynchronously. Calling tensor.to() doesn’t seem to allow you to specify the output buffer or request output to be placed in pin-memory. So using tensor.to() for GPU->CPU transfer is always synchronous. However I think I was able to get async transfer by creating a pin-memory buffer in cpu and using pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True). Can you confirm this is the correct way to achieve asynchronous GPU->CPU data transfer?