I am fine-tuning on a custom dataset using ResNet. I have met with this strange issue. When I try to train the model, I see the following error message:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingHostAllocator.cpp line=271 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 282, in <module>
main()
File "train.py", line 141, in main
epoch)
File "train.py", line 179, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 316, in __next__
batch = pin_memory_batch(batch)
File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 200, in pin_memory_batch
return [pin_memory_batch(sample) for sample in batch]
File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 200, in <listcomp>
return [pin_memory_batch(sample) for sample in batch]
File "/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 194, in pin_memory_batch
return batch.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingHostAllocator.cpp:271
I tried to reduce the batch size from 64 all the way to batch size 1. But this error still persists. Then I tried to change the training code and use pin_memory=False
in the dataloader parameter. Now I can train the mode with batch size 64 without any error.
I have always used pin_memory=True
in the dataloader and I have never experienced this kind of error before. Why is this failing now?
The pytorch version I am using is 0.4.1.