Using pined memory causes out-of-memory error even though batch size is set to low values

jdhao · November 27, 2018, 2:11am

I am fine-tuning on a custom dataset using ResNet. I have met with this strange issue. When I try to train the model, I see the following error message:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingHostAllocator.cpp line=271 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 282, in <module>
    main()
  File "train.py", line 141, in main
    epoch)
  File "train.py", line 179, in train
    for batch_idx, (data, target) in enumerate(train_loader):
  File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 316, in __next__
    batch = pin_memory_batch(batch)
  File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 200, in pin_memory_batch
    return [pin_memory_batch(sample) for sample in batch]
  File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 200, in <listcomp>
    return [pin_memory_batch(sample) for sample in batch]
  File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 194, in pin_memory_batch
    return batch.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingHostAllocator.cpp:271

I tried to reduce the batch size from 64 all the way to batch size 1. But this error still persists. Then I tried to change the training code and use pin_memory=False in the dataloader parameter. Now I can train the mode with batch size 64 without any error.

I have always used pin_memory=True in the dataloader and I have never experienced this kind of error before. Why is this failing now?

The pytorch version I am using is 0.4.1.

jdhao · November 28, 2018, 11:18am

Can anyone comment on this issue? Have you ever encountered such issue?

ptrblck · November 28, 2018, 8:07pm

Which OS are you using?
I’ve tried to search for this issue and there seem to be some page locked limitations for older Windows versions.
Also, scattered memory allocations on the RAM might be another issue, as page locked memory need to be contiguous as far as I know.
Could you in the worst case restart your machine and try it again with pin_memory=True?

jdhao · November 29, 2018, 2:17am

I am using pytorch on a CentOS 7 server. Previously, I have been using pin_memory=True in my code without any error on the same machine.

Restarting the server is currently not possible. So I just settle for pin_memory=False for now.

atiorh13 · September 15, 2020, 8:53pm

Was hitting a very similar error. Turns out, my previous processes were not properly killed via Ctrl+C but they weren’t showing up on nvidia-smi either. Force killing those previous processes fixed the issue for me

NagabhushanSN95 · January 10, 2022, 1:01am

Did anyone find a solution to this? I’m facing the exact same issue (there were no residual unskilled processes). I would like to know if restarting helped someone before placing a request for machine restart.

desaixie · April 25, 2022, 2:53am

I had a similar problem, although when I use a small batch size there were no error. The out of memory error caused by pin_memory happens when I am using less than half of all (64GB) my memory. There isn’t much discussion on this issue, but after trying every possible way to solve this, I suspect the cause of this error is WSL’s limit on page locked memory size. My solution was to give up WSL and switch to a dual boot Ubuntu.