Should `pin_memory` always used with `non_blocking=True`?

yzs · October 12, 2022, 11:56am

Simplest case:

import torch
import torch.cuda.nvtx as nvtx
from contextlib import contextmanager

@contextmanager
def stream_wrapper(stream):
    yield torch.cuda.stream(stream=stream)

non_blocking = True
s1 = torch.cuda.current_stream()
cuda0 = torch.device('cuda:0')

a = torch.tensor([1., 2.]).pin_memory().to(device=cuda0)
for _ in range(5):
    nvtx.range_push('copy')
    with stream_wrapper(s1):
        a = a.to(device=torch.device('cpu'), non_blocking=non_blocking)
    s1.synchronize()
    with stream_wrapper(s1):
        a = a.to(device=cuda0, non_blocking=non_blocking)
    s1.synchronize()
    nvtx.range_pop()

If the non_blocking is set to False, the a.to() will only launch cudaMemcpy on pageable memory. If I understand the two right, I believe this is weird since the only difference between a pin_memory()-ed tensor and a normal one should be the virtual memory settings in CPU memory. And non_blocking should only affect the specific memory copy methods: cudamemcpy or cudamemcpyasync.

My original purpose is to use pinned memory for fast CPU-GPU swapping (synced) of tensors. Currently, I could only use cuda stream to wait for the non_blocking copy. I believe if tensor.to(non_blocking=False) could achieve high performance on pinned tensors, I will not need to explicitly synchonize the stream.