Non_blocking in pytorch

The pytorch document says that "GPU copies are much faster when they originate from pinned method, that returns a copy of the object, with data put in a pinned region.
Also, once you pin a tensor or storage, you can use asynchronous GPU copies. Just pass an additional non_blocking=True argument to a [to()] used to overlap data transfers with computation.

So this means the time should be faster. Instead, it taking more time. After placing non_blocking=True, code takes 1 hour+ to run whereas non putting this takes 11 mins


Non-blocking copies would require page-locked memory, which won’t be available to the system anymore. If you are decreasing the host RAM too much, your OS might suffer and start to use the swap (which could explain the drastic slowdown). Could you check if that’s the case in your setup?