If you load your samples in the Dataset on CPU and would like to push it during training to the GPU, you can speed up the host to device transfer by enabling pin_memory.
This lets your DataLoader allocate the samples in page-locked memory, which speeds-up the transfer.
You can find more information on the NVIDIA blog.
Generally, yes, if you are loading your data on the CPU.
It won’t work if your dataset is really small and you push it already to the GPU in the Dataset, but that’s pretty obvious.
If I store the paths to my images in my dataset (as I can’t fit my dataset into RAM) and only load the images in the getitem function is it still beneficial to use pin memory?
@ptrblck Surprisingly, I am getting better speed-ups with pin_memory=False. Also, the CPU memory usage has also reduced by 2 or 3 folds with pin_memory=False. Just curious what might be the reason.
This issue occurs in Pytorch 1.1.0 version. I have not tried out the latest nightly version. My device has CUDA 9.0 (I cannot upgrade CUDA), which I guess is not supported by the latest nightly version.
when setting pin_memory=True, many more processes are forked than with pin_memory=False, even with num_workers=0 in all cases. What is the reason for that? For me pin_memory=False is much faster in this case.
The tensors come out of my dataloader as CPU tensors and before running my network I call to(device=MY_GPU_ID). Is that wrong? How else should I specify on which GPU I want the tensor to be?
We had an issue using too many threads when pin_memory=True was set recently, which should be fixed in the latest release.
Could you update your PyTorch installation and check the CPU usage again?
Sorry for bumping up this rather old thread. But how does this work with two and more GPUs?
In my use case, I have two experiments that run in parallel, two different models. I use pin_memory=True in one of my experiments with cuda:0 device. Then I run one more experiment but with cuda:1, and it fails with OOM error because of making attempt to allocate the memory on another GPU. Is it an expected behavior?
Hi! I stumbled upon this thread, today wondering about the slowing down of my training.It looks as if the same problem still exists. When setting pin_memory=True it processes roughly half as many iterations as if set to False.
I’m using CUDA10.2 and PyTorch 1.6.0 on Win10 on a K80. My dataset is fairly small with just around 300MB, could that be the reason?
Thanks for the answers. I wanted to clarify one more thing:
I have a setup where I push the model to the GPU, and then push the input tensors to the GPU only during the training like so:
def train(model, dataloader):
for epoch in range(num_of_epochs):
for image_batch, label_batch in dataloader:
image_batch, label_batch = image_batch.to(device), label_batch.to(device)
optimizer.zero_grad()
...
However, I am not observing any speed-up. Not even after the first epoch, which I thought how pin_memory=True would affect the process.
So I have two further questions:
Is pushing to the device during __getitem__ the best-practice? In SO, I saw a comment stating otherwise. If not, then was my example, pushing during training loop, correct?
Do we have to state num_workers in order to get performance help from pin_memory=True?
If you want (and/or have to) to lazily load the data, then I would stick with the Dataset approach and load each sample in the __getitem__. You could e.g. preload the complete MNIST dataset and might even be able to push it to the GPU before the training begins (which would give you a performance benefit during training), but that’s of course not always possible especially if you are working with larger datastes.
Try to use non_blocking=True in the to() operation.