Issue with dataloader using pin_memory = True

Hello, I’m seeing an odd issue with using the pin_memory = true flag with the dataloader. I’m measuring the time taken to transfer data from the host RAM to GPU memory as follows:

    transfer_time_start = time.time()
    input = input.cuda(args.gpu, non_blocking=False)
    target = target.cuda(args.gpu, non_blocking=False)
    torch.cuda.synchronize()
    transfer_time.update(time.time()-transfer_time_start)

with pin_memory = True in the dataloader, this gives me a transfer time of 0.03 sec, which for a batch size of 256, translates into 25622422434/0.03 = 5.1GB, which is a bit low for my CPU-GPU interconnect (x16, PCIe3) which should deliver ~12GB.

I then tried calling pin_memory() manually on the tensor returned by the enumerate call, as shown below:

for i, (input, target) in enumerate(train_loader):
input = input.pin_memory()
# measure data loading time
data_time.update(time.time() - end)
transfer_time_start = time.time()

    input = input.cuda(args.gpu, non_blocking=False)
    target = target.cuda(args.gpu, non_blocking=False)
    torch.cuda.synchronize()
    transfer_time.update(time.time()-transfer_time_start)

Now the transfer time dropped to 0.014, which translates to ~11GB, which is as expected. Anyone has any ideas why setting pin_memory = True in the data loader may not return a tensor already in pinned memory?

Also attached below are two plots showing the transfer time (green plot) from host memory to the GPU.
This plot shows the transfer time when I call pin_memory manually


You can see that the transfer time stays consistently low.

Whereas this one shows the transfer time without calling pin_memory manually. Now the transfer time is highly variable and averages to around 0.03 sec
image

1 Like

I can not speak much about the manual approach as I haven’t tried it, but regarding pin_memory=True I observe in practice that it slows done the training at about 2x (compared to False) – tested it in PyTorch 0.4.1 and 1.0 and on two independent machines (one with 1080Ti’s and one with Titan V’s). So, in practice, I abandoned using that. I remember there was a thread where someone mentioned similar observations.

So, it may well be that there’s a bug with pin_memory = True, esp. since you observe that the manual approach results in the expected speed up.

Thanks for the reply. As my experiments confirm, transfer to GPU is significantly faster for data in pinned memory, so it is worth doing it. Issue is that transfer to pinned memory itself costs time, and only saves time overall if it can be parallelized. The data loader seems to be doing this - it spins up a separate thread for transfer to pinned memory when pin_memory flag is set to True. When you call enumerate or next(iter), dataloader waits until a batch is available in pinned_memory so if the processing time on GPU is sufficiently long, then the latency of pinned memory transfer should be hidden, at least partly.

The question is why is transfer to the GPU still slow even though the batch is in pinned memory? One difference between the manual approach and the regular approach (not calling input.pin_memory() manually) is that in the manual approach, the transfer is done over the main thread, while in the regular approach, it is being done on another thread. Does this make a difference?

I can also verify this since I have the same observations were using pin_memory=True and num_workers=1 I see the gpu utilization at ~40% through all the training period but with pin_memory=False and num_workers=4 the gpu utilization is at ~90%. Plus I see my cpu at full utilization since the fans kick in but without pin_memory everything seems fine.