When to set pin_memory to true?

From the imagenet example:

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
        num_workers=args.workers, pin_memory=True, sampler=train_sampler)

    val_loader = torch.utils.data.DataLoader(
        datasets.ImageFolder(valdir, transforms.Compose([
        batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True)

Both data loaders set pin_memory to true, while the default value is false. Are there good reasons to do this?


If you load your samples in the Dataset on CPU and would like to push it during training to the GPU, you can speed up the host to device transfer by enabling pin_memory.
This lets your DataLoader allocate the samples in page-locked memory, which speeds-up the transfer.
You can find more information on the NVIDIA blog.


Thanks. This means pin_memory should always be true if you are training on a Nvidia GPU, right?

Generally, yes, if you are loading your data on the CPU.
It won’t work if your dataset is really small and you push it already to the GPU in the Dataset, but that’s pretty obvious.


If I store the paths to my images in my dataset (as I can’t fit my dataset into RAM) and only load the images in the getitem function is it still beneficial to use pin memory?

1 Like

Yes. As far as I know, that’s the exact use case for speeding up the transfer between the host and device.


@ptrblck Surprisingly, I am getting better speed-ups with pin_memory=False. Also, the CPU memory usage has also reduced by 2 or 3 folds with pin_memory=False. Just curious what might be the reason.


We had a recent issue, which should have been fixed by now. Are you seeing the same behavior using the latest nightly build?

This issue occurs in Pytorch 1.1.0 version. I have not tried out the latest nightly version. My device has CUDA 9.0 (I cannot upgrade CUDA), which I guess is not supported by the latest nightly version.

The binaries ship with their own CUDA and cudnn versions so that you don’t have to install CUDA locally (just the NVIDIA drivers).

Oh okay. Thanks for the help :slight_smile:


when setting pin_memory=True, many more processes are forked than with pin_memory=False, even with num_workers=0 in all cases. What is the reason for that? For me pin_memory=False is much faster in this case.

The tensors come out of my dataloader as CPU tensors and before running my network I call to(device=MY_GPU_ID). Is that wrong? How else should I specify on which GPU I want the tensor to be?

Pytorch is 1.2.0



We had an issue using too many threads when pin_memory=True was set recently, which should be fixed in the latest release.
Could you update your PyTorch installation and check the CPU usage again?

On PyTorch 1.13 I am still getting faster runtimes with pinned_memory=False and using num_workers=20. I have 4 NVIDIA P100 with 256 GB IBM Power8.

The tensors that comeout from the DataLoader are on CPU, (at least this is what print(t.device) says), even if:


What’s the correct way to upload these directly on the GPU?

pin_memory=True uses pinned host memory for faster copies as described here.
Push your tensors via x =x.to('cuda') to the device.

1 Like

Sorry for bumping up this rather old thread. But how does this work with two and more GPUs?

In my use case, I have two experiments that run in parallel, two different models. I use pin_memory=True in one of my experiments with cuda:0 device. Then I run one more experiment but with cuda:1, and it fails with OOM error because of making attempt to allocate the memory on another GPU. Is it an expected behavior?