Out of memory during torch.save


I was training a model which takes around 16 GB during training and ~5GB during validation but when I try to save a checkpoint(after validation) the GPU seems to run out of memory. These numbers are for a batch size of 64, if I drop the batch size down to even 32 the memory required for training goes down to 9 GB but it still runs out of memory while trying to save the model.

I am saving only the state_dict, using CUDA 8.0 with PyTorch 2.01 and running this on a 16 GB GPU.

Below is the stack trace for the error

THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=38 error=2 : out of memory
Testing Results: Prec@1 52.974 Prec@5 80.941 Loss 1.75623
Traceback (most recent call last):
  File "main.py", line 264, in <module>
  File "main.py", line 109, in main
    }, is_best)
  File "main.py", line 213, in save_checkpoint
    torch.save(state, filename)
  File "/export/home/utsav/.local/lib/python2.7/site-packages/torch/serialization.py", line 120, in save
    return _save(obj, f, pickle_module, pickle_protocol)
  File "/export/home/utsav/.local/lib/python2.7/site-packages/torch/serialization.py", line 192, in _save
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/csrc/generic/serialization.cpp:38
THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=2 : out of memory
1 Like

Shifting the model to CPU before saving and moving it back to GPU after saving solved it.


how did you shift the model to cpu

model.cpu() or model.to('cpu') should work.

1 Like

Hi, I have spotted the same problem. The suggested fix works for single GPU training. However, for distributed training, the model freezes after model.cuda(). Do you have any suggestions for distributed training?

I’m not sure where you are using model.cuda(), since the original issue was seen during saving and the workaround was to push the model to the CPU first.

Thanks for your response. I save model every epoch and need model.cuda() at the beginining of a new epoch after pushing the model to the CPU.

This would indicate that other/additional tensors are stored on the GPU, so that you cannot push the model back to it, so you should delete the other objects on the GPU (e.g. the old model in case it’s still on the GPU).

Models are trained using distributed data parallel and each process contains a model. Only the model at the master process is pushed to CPU and then back to CUDA. Is it possible that the training process freezes due to synchronization problem? How can I delete the other objects on the GPU and create new model and start distributed training?