Non-blocking device to host transfer

Here is a related discussion FYR:

Here is the TLDR:

  1. gpu_tensor.to(device="cpu", non_blocking=True) will asynchronously copy a tensor from GPU to CPU pinned memory
  2. As you point out, pinned_cpu_tensor.copy_(gpu_tensor, non_blocking=True) will asynchronously copy a tensor from GPU to an existing CPU pinned memory tensor
  3. Once the asynchronous copy has been launched, you need to use torch.cuda.Stream.synchronize() or torch.cuda.synchronize() to ensure the copy has finished before using its content