Using pined memory causes out-of-memory error even though batch size is set to low values

I am fine-tuning on a custom dataset using ResNet. I have met with this strange issue. When I try to train the model, I see the following error message:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingHostAllocator.cpp line=271 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 282, in <module>
    main()
  File "train.py", line 141, in main
    epoch)
  File "train.py", line 179, in train
    for batch_idx, (data, target) in enumerate(train_loader):
  File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 316, in __next__
    batch = pin_memory_batch(batch)
  File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 200, in pin_memory_batch
    return [pin_memory_batch(sample) for sample in batch]
  File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 200, in <listcomp>
    return [pin_memory_batch(sample) for sample in batch]
  File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 194, in pin_memory_batch
    return batch.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingHostAllocator.cpp:271 

I tried to reduce the batch size from 64 all the way to batch size 1. But this error still persists. Then I tried to change the training code and use pin_memory=False in the dataloader parameter. Now I can train the mode with batch size 64 without any error.

I have always used pin_memory=True in the dataloader and I have never experienced this kind of error before. Why is this failing now?

The pytorch version I am using is 0.4.1.

Can anyone comment on this issue? Have you ever encountered such issue?

Which OS are you using?
I’ve tried to search for this issue and there seem to be some page locked limitations for older Windows versions.
Also, scattered memory allocations on the RAM might be another issue, as page locked memory need to be contiguous as far as I know.
Could you in the worst case restart your machine and try it again with pin_memory=True?

I am using pytorch on a CentOS 7 server. Previously, I have been using pin_memory=True in my code without any error on the same machine.

Restarting the server is currently not possible. So I just settle for pin_memory=False for now.

Was hitting a very similar error. Turns out, my previous processes were not properly killed via Ctrl+C but they weren’t showing up on nvidia-smi either. Force killing those previous processes fixed the issue for me