Non blocking memory transfer to GPU

Hi,

I have a few questions about non blocking memory transfer from CPU to GPU.

  1. CUDA opeations are asynchronous, so as far as I understand when I run x.square() (for an x that is on GPU) I actually just enqueue the square kernel to be run at a later time. Why then can’t all memory transfers from CPU to GPU be async? Can’t a code like the following first enqueue a “move to GPU memory” event followed by an “execute the square kernel” event?
x = x.to('cuda', non_blocking=True)
x = x.square()
  1. Why can non-blocking transfer only happen for memory-pinned (page-locked) tensors?
  2. What happens if I run the following?
x = x.to('cuda', non_blocking=True)
...
x = x.to('cuda')

+does the answer differ if the second line is called when the data of x has already been transferred? Specifically, if I use a non-blocking transfer and somewhere later in the code a blocking transfer occurs, do I risk losing the benefits?

Thanks,
Yiftach

You can make data transfers asynchronous if you want to pin the corresponding host memory. Since allocating a lot of pinned memory might cause issues it’s not the default but can be enabled in the DataLoader.
These docs explain the sync/async behavior in more detail.

Thanks @ptrblck, I’m aware of pinning memory on DataLoader. The text in the link has me confused:

  1. For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
  2. For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.

Sounds to me opposite of what PyTorch docs say (pinned can be async, default non-pinned is sync).
I’m aware the misunderstanding is most likely on my side, but would still love a more in-depth explanation (or a reference to a more detailed resource). I’d also still love an answer to each of the questions, if possible.

can you elaborate on the issues that can be caused by allocating a lot of pinned memory or give a link to a resource? In my project where I’m hunting my OOM Issues ( Out of Memory after 2 hours - audio - PyTorch Forums) I’ve actually set pin_memory to True and I’m interested in whether this might have caused the problems.

Pinned memory is page-locked - the OS cannot swap it out, see e.g.:

1 Like