Should `pin_memory` always used with `non_blocking=True`?

Simplest case:

import torch
import torch.cuda.nvtx as nvtx
from contextlib import contextmanager

def stream_wrapper(stream):

non_blocking = True
s1 = torch.cuda.current_stream()
cuda0 = torch.device('cuda:0')

a = torch.tensor([1., 2.]).pin_memory().to(device=cuda0)
for _ in range(5):
    with stream_wrapper(s1):
        a ='cpu'), non_blocking=non_blocking)
    with stream_wrapper(s1):
        a =, non_blocking=non_blocking)

If the non_blocking is set to False, the will only launch cudaMemcpy on pageable memory. If I understand the two right, I believe this is weird since the only difference between a pin_memory()-ed tensor and a normal one should be the virtual memory settings in CPU memory. And non_blocking should only affect the specific memory copy methods: cudamemcpy or cudamemcpyasync.

My original purpose is to use pinned memory for fast CPU-GPU swapping (synced) of tensors. Currently, I could only use cuda stream to wait for the non_blocking copy. I believe if could achieve high performance on pinned tensors, I will not need to explicitly synchonize the stream.