Understanding pin_memory performance boosts

I have my training data on CPU memory. I tested datasets and dataloaders with the following three scenarios:

  1. Do not use pin_memory at all
  2. Use pin_memory in dataloader, along with async CPU=>GPU tensor copy before model forward call. But the original tensor in the dataset (on CPU) is not pinned.
  3. Pin the whole tensor in advance before feeding it to dataset (then it doesn’t matter whether to set the pin_memory in dataloader), and also use async copy as in 2.

I noticed that 2 & 3 gives significant time savings ( > 30% in this test case), but 2 and 3 have almost idential time cost. I want to understand why.

My understanding of the pin_memory is that it moves the CPU tensor to non-pageable CPU memory, which a. does the extra pageable=>non-pageable copy in advance, b. enables async copy to GPU mem. If that’s the case, wouldn’t 3 above give even more savings by avoiding pinning tensors again and again in the dataloader iterations?

Since I’m not seeing any improvements of 2=>3, is it fair to say that the “pin” operation itself is not of great cost (so it’s OK to repetitively do it in dataloader iterations) ? The savings come from the async/batch operations after the pin, and the pay (the pin itself) is almost free. Is this true in general? And are there any cases that it might be beneficial to pre-pin the whole dataset to avoid repetitive pins in the dataloader (as in 3 above) ?

Or, it could be the other way around?:
pre-pin the whole CPU dataset is always a bad idea, you should just do the on-the-fly pin as in the standard dataloader (which will give almost identical performance and avoid other headaches from pre-pin a huge dataset) ?