Should I ever use non_blocking=False?

What is the reason for having the non_blocking flag in various functions that transfer tensors between devices? And why is it set to False by default?

It is my understanding that all GPU operations are completed asynchronously by default (and, in fact, this is vital for achieving a good performance). So why does pytorch make an exception for transfers?

The only reasons for not using non_blocking that are mentioned in the docs are benchmarking and CUDA streams. However these are pretty niche use-cases, and one should probably use an explicit synchronization for those anyway.

I’ve been spraying my code with non_blocking=True on assumption that this will never hurt and might occasionally speed things up. Am I wrong about that?

For the async transfer you would need to use pinned memory, which is a limited system resource and which is thus the reason it’s not used by default.

Does non_blocking force pinning?

Other answers on this forum (1, 2, 3) seem to imply non_blocking=True might make the transfer asynchronous, if the CPU tensor is already pinned. Otherwise, it’ll just silently turn into synchronous transfer. But, in that case, there wouldn’t be any extra pinned memory used…
What am I missing here?

No, it doesn’t use pinned memory automatically and is thus set to False. I would see this default as sticking to the “Zen of Python” - “Explicit is better than implicit”. Setting non_blocking=True and speculate for pinned memory sounds wrong.

It seems to be falling short of this goal anyway. If the idea was to avoid obscure performance problems caused by a distant change, torch should have thrown an exception in case async transfer is not possible. Or at least printed a warning.