From the imagenet example:
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
num_workers=args.workers, pin_memory=True, sampler=train_sampler)
val_loader = torch.utils.data.DataLoader(
Both data loaders set
pin_memory to true, while the default value is false. Are there good reasons to do this?
If you load your samples in the
Dataset on CPU and would like to push it during training to the GPU, you can speed up the host to device transfer by enabling
This lets your
DataLoader allocate the samples in page-locked memory, which speeds-up the transfer.
You can find more information on the NVIDIA blog.
Thanks. This means
pin_memory should always be true if you are training on a Nvidia GPU, right?
Generally, yes, if you are loading your data on the CPU.
It won’t work if your dataset is really small and you push it already to the GPU in the
Dataset, but that’s pretty obvious.
If I store the paths to my images in my dataset (as I can’t fit my dataset into RAM) and only load the images in the getitem function is it still beneficial to use pin memory?
Yes. As far as I know, that’s the exact use case for speeding up the transfer between the host and device.
@ptrblck Surprisingly, I am getting better speed-ups with pin_memory=False. Also, the CPU memory usage has also reduced by 2 or 3 folds with pin_memory=False. Just curious what might be the reason.
We had a recent issue, which should have been fixed by now. Are you seeing the same behavior using the latest nightly build?
This issue occurs in Pytorch 1.1.0 version. I have not tried out the latest nightly version. My device has CUDA 9.0 (I cannot upgrade CUDA), which I guess is not supported by the latest nightly version.
The binaries ship with their own CUDA and cudnn versions so that you don’t have to install CUDA locally (just the NVIDIA drivers).
Oh okay. Thanks for the help
when setting pin_memory=True, many more processes are forked than with pin_memory=False, even with num_workers=0 in all cases. What is the reason for that? For me pin_memory=False is much faster in this case.
The tensors come out of my dataloader as CPU tensors and before running my network I call to(device=MY_GPU_ID). Is that wrong? How else should I specify on which GPU I want the tensor to be?
Pytorch is 1.2.0
We had an issue using too many threads when
pin_memory=True was set recently, which should be fixed in the latest release.
Could you update your PyTorch installation and check the CPU usage again?
On PyTorch 1.13 I am still getting faster runtimes with
pinned_memory=False and using num_workers=20. I have 4 NVIDIA P100 with 256 GB IBM Power8.
The tensors that comeout from the
DataLoader are on CPU, (at least this is what
print(t.device) says), even if:
What’s the correct way to upload these directly on the GPU?
pin_memory=True uses pinned host memory for faster copies as described here.
Push your tensors via
x =x.to('cuda') to the device.
Sorry for bumping up this rather old thread. But how does this work with two and more GPUs?
In my use case, I have two experiments that run in parallel, two different models. I use
pin_memory=True in one of my experiments with
cuda:0 device. Then I run one more experiment but with
cuda:1, and it fails with OOM error because of making attempt to allocate the memory on another GPU. Is it an expected behavior?
Hi! I stumbled upon this thread, today wondering about the slowing down of my training.It looks as if the same problem still exists. When setting pin_memory=True it processes roughly half as many iterations as if set to False.
I’m using CUDA10.2 and PyTorch 1.6.0 on Win10 on a K80. My dataset is fairly small with just around 300MB, could that be the reason?
Thanks for the answers. I wanted to clarify one more thing:
I have a setup where I push the model to the GPU, and then push the input tensors to the GPU only during the training like so:
def train(model, dataloader):
for epoch in range(num_of_epochs):
for image_batch, label_batch in dataloader:
image_batch, label_batch = image_batch.to(device), label_batch.to(device)
However, I am not observing any speed-up. Not even after the first epoch, which I thought how
pin_memory=True would affect the process.
So I have two further questions:
- Is pushing to the device during
__getitem__ the best-practice? In SO, I saw a comment stating otherwise. If not, then was my example, pushing during training loop, correct?
- Do we have to state
num_workers in order to get performance help from