How is explicit `pin_memory` different from just calling `.to` and let CUDA handle it?

My question is which of these two is faster and why?

  1. t.to("cuda")
  2. t.pin_memory().to("cuda")

I have this question because:

This CUDA blog is referenced a lot in related discussion and it says:

the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory

It reads to me, CUDA will do a pin_memory operation itself if the data is not already pinned. I understand pin_memory enables us to set the async flag non_blocking=True. However, if instead of using .to("cuda", non_blocking=True), we use .to("cuda"), why does explicitly calling torch.tensor.pin_memory() make the data transfer faster than without such an explicit call?

Using pinned memory allows you to asynchronously transfer the data. This might be beneficial if you are running other workloads between the transfer and the actual data usage. However, if the next call would need to consume the data you wouldn’t expect to see any benefits from the async copy.
You can also let the DataLoader pin the memory and could move it inside the training loop, which is a common use case.

Thank you for your prompt reply, Piotr!

Is it then correct to conclude these two commands are equal in speed

  1. t.to("cuda", non_blocking=False)
  2. t.pin_memory().to("cuda", non_blocking=False)

and pin_memory should always be used with non_blocking=True?

You should profile the code, but the second one might be a bit faster (while still not async) as the staging buffer copy would be avoided.