When to set pin_memory to true?

From the imagenet example:

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
        num_workers=args.workers, pin_memory=True, sampler=train_sampler)

    val_loader = torch.utils.data.DataLoader(
        datasets.ImageFolder(valdir, transforms.Compose([
        batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True)

Both data loaders set pin_memory to true, while the default value is false. Are there good reasons to do this?


If you load your samples in the Dataset on CPU and would like to push it during training to the GPU, you can speed up the host to device transfer by enabling pin_memory.
This lets your DataLoader allocate the samples in page-locked memory, which speeds-up the transfer.
You can find more information on the NVIDIA blog.


Thanks. This means pin_memory should always be true if you are training on a Nvidia GPU, right?


Generally, yes, if you are loading your data on the CPU.
It won’t work if your dataset is really small and you push it already to the GPU in the Dataset, but that’s pretty obvious.


If I store the paths to my images in my dataset (as I can’t fit my dataset into RAM) and only load the images in the getitem function is it still beneficial to use pin memory?


Yes. As far as I know, that’s the exact use case for speeding up the transfer between the host and device.


@ptrblck Surprisingly, I am getting better speed-ups with pin_memory=False. Also, the CPU memory usage has also reduced by 2 or 3 folds with pin_memory=False. Just curious what might be the reason.


We had a recent issue, which should have been fixed by now. Are you seeing the same behavior using the latest nightly build?

This issue occurs in Pytorch 1.1.0 version. I have not tried out the latest nightly version. My device has CUDA 9.0 (I cannot upgrade CUDA), which I guess is not supported by the latest nightly version.

The binaries ship with their own CUDA and cudnn versions so that you don’t have to install CUDA locally (just the NVIDIA drivers).

Oh okay. Thanks for the help :slight_smile:


when setting pin_memory=True, many more processes are forked than with pin_memory=False, even with num_workers=0 in all cases. What is the reason for that? For me pin_memory=False is much faster in this case.

The tensors come out of my dataloader as CPU tensors and before running my network I call to(device=MY_GPU_ID). Is that wrong? How else should I specify on which GPU I want the tensor to be?

Pytorch is 1.2.0



We had an issue using too many threads when pin_memory=True was set recently, which should be fixed in the latest release.
Could you update your PyTorch installation and check the CPU usage again?

On PyTorch 1.13 I am still getting faster runtimes with pinned_memory=False and using num_workers=20. I have 4 NVIDIA P100 with 256 GB IBM Power8.

The tensors that comeout from the DataLoader are on CPU, (at least this is what print(t.device) says), even if:


What’s the correct way to upload these directly on the GPU?

pin_memory=True uses pinned host memory for faster copies as described here.
Push your tensors via x =x.to('cuda') to the device.

1 Like

Sorry for bumping up this rather old thread. But how does this work with two and more GPUs?

In my use case, I have two experiments that run in parallel, two different models. I use pin_memory=True in one of my experiments with cuda:0 device. Then I run one more experiment but with cuda:1, and it fails with OOM error because of making attempt to allocate the memory on another GPU. Is it an expected behavior?

1 Like

Hi! I stumbled upon this thread, today wondering about the slowing down of my training.It looks as if the same problem still exists. When setting pin_memory=True it processes roughly half as many iterations as if set to False.

I’m using CUDA10.2 and PyTorch 1.6.0 on Win10 on a K80. My dataset is fairly small with just around 300MB, could that be the reason?

Thanks for the answers. I wanted to clarify one more thing:
I have a setup where I push the model to the GPU, and then push the input tensors to the GPU only during the training like so:

def train(model, dataloader):
    for epoch in range(num_of_epochs):
        for image_batch, label_batch in dataloader:
            image_batch, label_batch = image_batch.to(device), label_batch.to(device)

However, I am not observing any speed-up. Not even after the first epoch, which I thought how pin_memory=True would affect the process.
So I have two further questions:

  1. Is pushing to the device during __getitem__ the best-practice? In SO, I saw a comment stating otherwise. If not, then was my example, pushing during training loop, correct?
  2. Do we have to state num_workers in order to get performance help from pin_memory=True?
1 Like
  1. If you want (and/or have to) to lazily load the data, then I would stick with the Dataset approach and load each sample in the __getitem__. You could e.g. preload the complete MNIST dataset and might even be able to push it to the GPU before the training begins (which would give you a performance benefit during training), but that’s of course not always possible especially if you are working with larger datastes.

  2. Try to use non_blocking=True in the to() operation.