Out of memory during torch.save

utsav · August 19, 2017, 2:47am

Hi,

I was training a model which takes around 16 GB during training and ~5GB during validation but when I try to save a checkpoint(after validation) the GPU seems to run out of memory. These numbers are for a batch size of 64, if I drop the batch size down to even 32 the memory required for training goes down to 9 GB but it still runs out of memory while trying to save the model.

I am saving only the state_dict, using CUDA 8.0 with PyTorch 2.01 and running this on a 16 GB GPU.

Below is the stack trace for the error

THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=38 error=2 : out of memory
Testing Results: Prec@1 52.974 Prec@5 80.941 Loss 1.75623
Traceback (most recent call last):
  File "main.py", line 264, in <module>
    main()
  File "main.py", line 109, in main
    }, is_best)
  File "main.py", line 213, in save_checkpoint
    torch.save(state, filename)
  File "/export/home/utsav/.local/lib/python2.7/site-packages/torch/serialization.py", line 120, in save
    return _save(obj, f, pickle_module, pickle_protocol)
  File "/export/home/utsav/.local/lib/python2.7/site-packages/torch/serialization.py", line 192, in _save
    serialized_storages[key]._write_file(f)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/csrc/generic/serialization.cpp:38
THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=2 : out of memory

utsav · August 19, 2017, 5:50am

Shifting the model to CPU before saving and moving it back to GPU after saving solved it.

ParthTrehan · September 6, 2020, 8:06am

how did you shift the model to cpu

ptrblck · September 7, 2020, 9:02am

model.cpu() or model.to('cpu') should work.

bily · August 5, 2021, 11:30am

Hi, I have spotted the same problem. The suggested fix works for single GPU training. However, for distributed training, the model freezes after model.cuda(). Do you have any suggestions for distributed training?

ptrblck · August 8, 2021, 6:22am

I’m not sure where you are using model.cuda(), since the original issue was seen during saving and the workaround was to push the model to the CPU first.

bily · August 9, 2021, 3:28am

Thanks for your response. I save model every epoch and need model.cuda() at the beginining of a new epoch after pushing the model to the CPU.

ptrblck · August 9, 2021, 4:19am

This would indicate that other/additional tensors are stored on the GPU, so that you cannot push the model back to it, so you should delete the other objects on the GPU (e.g. the old model in case it’s still on the GPU).

bily · August 9, 2021, 4:36am

Models are trained using distributed data parallel and each process contains a model. Only the model at the master process is pushed to CPU and then back to CUDA. Is it possible that the training process freezes due to synchronization problem? How can I delete the other objects on the GPU and create new model and start distributed training?

arnaud-nt2i · April 24, 2024, 8:31am

@bily @ptrblck This is an old post, but I am facing the same unanswered question,
On solution, I see is:

on rank 0 or 1 : model.cpu() + ckpt = deepcopy(model.module.half()) + torch.save(ckpt, f)
del model + torch.cuda.empty_cache() + gc.collect()
re instantiate ddp training from model= torch.load(f)

But this is not elegant, a more natural solution should be to model.cpu() and right after model.cuda() ( but with ddp compatibility)
Or being able to make a copy to cpu of the model without altering the gpu memory…