Simplest case:
import torch
import torch.cuda.nvtx as nvtx
from contextlib import contextmanager
@contextmanager
def stream_wrapper(stream):
yield torch.cuda.stream(stream=stream)
non_blocking = True
s1 = torch.cuda.current_stream()
cuda0 = torch.device('cuda:0')
a = torch.tensor([1., 2.]).pin_memory().to(device=cuda0)
for _ in range(5):
nvtx.range_push('copy')
with stream_wrapper(s1):
a = a.to(device=torch.device('cpu'), non_blocking=non_blocking)
s1.synchronize()
with stream_wrapper(s1):
a = a.to(device=cuda0, non_blocking=non_blocking)
s1.synchronize()
nvtx.range_pop()
If the non_blocking
is set to False
, the a.to()
will only launch cudaMemcpy
on pageable memory. If I understand the two right, I believe this is weird since the only difference between a pin_memory()
-ed tensor and a normal one should be the virtual memory settings in CPU memory. And non_blocking
should only affect the specific memory copy methods: cudamemcpy
or cudamemcpyasync
.
My original purpose is to use pinned memory for fast CPU-GPU swapping (synced) of tensors. Currently, I could only use cuda stream to wait for the non_blocking
copy. I believe if tensor.to(non_blocking=False)
could achieve high performance on pinned tensors, I will not need to explicitly synchonize the stream.