When to set pin_memory to true?

kindlychung · June 14, 2018, 8:48pm

From the imagenet example:

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
        num_workers=args.workers, pin_memory=True, sampler=train_sampler)

    val_loader = torch.utils.data.DataLoader(
        datasets.ImageFolder(valdir, transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ])),
        batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True)

Both data loaders set pin_memory to true, while the default value is false. Are there good reasons to do this?

ptrblck · June 14, 2018, 8:52pm

If you load your samples in the Dataset on CPU and would like to push it during training to the GPU, you can speed up the host to device transfer by enabling pin_memory.
This lets your DataLoader allocate the samples in page-locked memory, which speeds-up the transfer.
You can find more information on the NVIDIA blog.

kindlychung · June 15, 2018, 8:15am

Thanks. This means pin_memory should always be true if you are training on a Nvidia GPU, right?

ptrblck · June 15, 2018, 10:25am

Generally, yes, if you are loading your data on the CPU.
It won’t work if your dataset is really small and you push it already to the GPU in the Dataset, but that’s pretty obvious.

cakeeatingpolarbear · December 27, 2018, 6:51am

If I store the paths to my images in my dataset (as I can’t fit my dataset into RAM) and only load the images in the getitem function is it still beneficial to use pin memory?

ptrblck · December 27, 2018, 1:34pm

Yes. As far as I know, that’s the exact use case for speeding up the transfer between the host and device.

Ravi_Tej_Akella · September 27, 2019, 3:30pm

@ptrblck Surprisingly, I am getting better speed-ups with pin_memory=False. Also, the CPU memory usage has also reduced by 2 or 3 folds with pin_memory=False. Just curious what might be the reason.

ptrblck · September 27, 2019, 8:03pm

We had a recent issue, which should have been fixed by now. Are you seeing the same behavior using the latest nightly build?

Ravi_Tej_Akella · September 28, 2019, 12:43pm

This issue occurs in Pytorch 1.1.0 version. I have not tried out the latest nightly version. My device has CUDA 9.0 (I cannot upgrade CUDA), which I guess is not supported by the latest nightly version.

ptrblck · September 28, 2019, 9:30pm

The binaries ship with their own CUDA and cudnn versions so that you don’t have to install CUDA locally (just the NVIDIA drivers).

Ravi_Tej_Akella · September 28, 2019, 10:38pm

Oh okay. Thanks for the help

PySimon · October 22, 2019, 6:49pm

Hi,

when setting pin_memory=True, many more processes are forked than with pin_memory=False, even with num_workers=0 in all cases. What is the reason for that? For me pin_memory=False is much faster in this case.

The tensors come out of my dataloader as CPU tensors and before running my network I call to(device=MY_GPU_ID). Is that wrong? How else should I specify on which GPU I want the tensor to be?

Pytorch is 1.2.0

Best,
Simon

ptrblck · October 22, 2019, 7:24pm

We had an issue using too many threads when pin_memory=True was set recently, which should be fixed in the latest release.
Could you update your PyTorch installation and check the CPU usage again?

Shubhankar · November 6, 2019, 1:22am

On PyTorch 1.13 I am still getting faster runtimes with pinned_memory=False and using num_workers=20. I have 4 NVIDIA P100 with 256 GB IBM Power8.

AndreaCatania · December 8, 2019, 1:02pm

The tensors that comeout from the DataLoader are on CPU, (at least this is what print(t.device) says), even if:

num_workers=1,
pin_memory=True

What’s the correct way to upload these directly on the GPU?

ptrblck · December 8, 2019, 4:38pm

pin_memory=True uses pinned host memory for faster copies as described here.
Push your tensors via x =x.to('cuda') to the device.

devforfu · August 1, 2020, 7:16pm

Sorry for bumping up this rather old thread. But how does this work with two and more GPUs?

In my use case, I have two experiments that run in parallel, two different models. I use pin_memory=True in one of my experiments with cuda:0 device. Then I run one more experiment but with cuda:1, and it fails with OOM error because of making attempt to allocate the memory on another GPU. Is it an expected behavior?

raharth · September 2, 2020, 12:17pm

Hi! I stumbled upon this thread, today wondering about the slowing down of my training.It looks as if the same problem still exists. When setting pin_memory=True it processes roughly half as many iterations as if set to False.

I’m using CUDA10.2 and PyTorch 1.6.0 on Win10 on a K80. My dataset is fairly small with just around 300MB, could that be the reason?

Emre_Yalcinoglu · December 21, 2020, 6:10pm

Thanks for the answers. I wanted to clarify one more thing:
I have a setup where I push the model to the GPU, and then push the input tensors to the GPU only during the training like so:

def train(model, dataloader):
    for epoch in range(num_of_epochs):
        for image_batch, label_batch in dataloader:
            image_batch, label_batch = image_batch.to(device), label_batch.to(device)
            optimizer.zero_grad()
            ...

However, I am not observing any speed-up. Not even after the first epoch, which I thought how pin_memory=True would affect the process.
So I have two further questions:

Is pushing to the device during __getitem__ the best-practice? In SO, I saw a comment stating otherwise. If not, then was my example, pushing during training loop, correct?
Do we have to state num_workers in order to get performance help from pin_memory=True?

ptrblck · December 23, 2020, 6:49pm

If you want (and/or have to) to lazily load the data, then I would stick with the Dataset approach and load each sample in the __getitem__. You could e.g. preload the complete MNIST dataset and might even be able to push it to the GPU before the training begins (which would give you a performance benefit during training), but that’s of course not always possible especially if you are working with larger datastes.
Try to use non_blocking=True in the to() operation.