How does DataLoader pin_memory=True help with data loading speed?

Suppose the original tensors in the dataset is not pinned. Then the pin_memory=True setting only (automatically) adds a pin operation for each of the tensors loaded from the dataset on the fly for that specific iteration (after collation, I believe). Isn’t that equivalent to these sequential steps?:

  1. pin an unpinned CPU tensor (which makes another copy to the non-pageable memory area).
  2. (later in training) copy the pinned tensor to the corresponding GPU device.

These two steps combined won’t be faster than directly copying an unpinned CPU tensor to GPU memory (it’s different from repetitively copying a pre-pinned CPU tensor to GPU, in which case it does save data transfer time). By this logic, the pin_memory=True option in DataLoader only adds some additional steps that are intrinsically sequential anyways, so how does it really help with data loading?

Note: this assumes num_workers=0 (the default and most common case). If num_workers is not 0, then the default process fork (I’m in linux) won’t work with different process pinning CPU tensors anyways (complaining about multiple CUDA initializations).

I think I must have some misunderstandings somewhere. Can someone help?

That’s not that case as pinned memory would allow you to trigger a non-blocking data transfer which won’t be possible otherwise. Take a look at this description I’ve posted in an issue for more details about the internals and how async copies are used.

Using pinned host memory won’t initialize a CUDA context and num_workers>0 in a DataLoader using pin_memory=True works.

1 Like

My bad. I re-tested it and you are right. I misconfigured something else that led to the previous error.

Follow up question on this: does it mean that if I don’t use non-blocking data transfer, then pin_memory won’t help? E.g. I only have one input tensor which is immediately used after .to(device) (hence non-blocking doesn’t help). Consider the following code snippet:

data = torch.rand(1000, 1000, 1000)  # input data on CPU memory
dataset = TensorDataset()
dataloader = DataLoader(dataset, pin_memory=True)
for x, in dataloader:
  x ='cuda:0', non_blocking=True)
  # do something with x immediately, e.g. feed it to some model on cuda:0

Is it fair to say that the pin_memory=True and non_blocking=True won’t help in this case? It only helps if:

  1. I do something else with CPU objects after the non_blocking transfer, or
  2. I have multiple input data (e.g. x, y, z, w ... ) so that the host to device data transfer can be processed in a batch + non-blocking way (they are already pinned by the dataloader), or
  3. I use multiple workers in dataloader (say num_workers=4) so that the pin_memory for multiple batch/iterations can happen simultaneously (not sure about this one, is it still sequential in the pin step?)

Am I getting it?

Not quite as the CPU might be able to still run ahead and schedule the next kernel launch even if the kernel has to wait for the transfer to finish. Besides that you are right that an async transfer would allow you to overlap it with other work assuming it’s available.

1 Like