I was training a model which takes around 16 GB during training and ~5GB during validation but when I try to save a checkpoint(after validation) the GPU seems to run out of memory. These numbers are for a batch size of 64, if I drop the batch size down to even 32 the memory required for training goes down to 9 GB but it still runs out of memory while trying to save the model.
I am saving only the state_dict, using CUDA 8.0 with PyTorch 2.01 and running this on a 16 GB GPU.
Below is the stack trace for the error
THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=38 error=2 : out of memory
Testing Results: Prec@1 52.974 Prec@5 80.941 Loss 1.75623
Traceback (most recent call last):
File "main.py", line 264, in <module>
main()
File "main.py", line 109, in main
}, is_best)
File "main.py", line 213, in save_checkpoint
torch.save(state, filename)
File "/export/home/utsav/.local/lib/python2.7/site-packages/torch/serialization.py", line 120, in save
return _save(obj, f, pickle_module, pickle_protocol)
File "/export/home/utsav/.local/lib/python2.7/site-packages/torch/serialization.py", line 192, in _save
serialized_storages[key]._write_file(f)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/csrc/generic/serialization.cpp:38
THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=2 : out of memory
Hi, I have spotted the same problem. The suggested fix works for single GPU training. However, for distributed training, the model freezes after model.cuda(). Do you have any suggestions for distributed training?
I’m not sure where you are using model.cuda(), since the original issue was seen during saving and the workaround was to push the model to the CPU first.
This would indicate that other/additional tensors are stored on the GPU, so that you cannot push the model back to it, so you should delete the other objects on the GPU (e.g. the old model in case it’s still on the GPU).
Models are trained using distributed data parallel and each process contains a model. Only the model at the master process is pushed to CPU and then back to CUDA. Is it possible that the training process freezes due to synchronization problem? How can I delete the other objects on the GPU and create new model and start distributed training?
@bily@ptrblck This is an old post, but I am facing the same unanswered question,
On solution, I see is:
on rank 0 or 1 : model.cpu() + ckpt = deepcopy(model.module.half()) + torch.save(ckpt, f)
del model + torch.cuda.empty_cache() + gc.collect()
re instantiate ddp training from model= torch.load(f)
But this is not elegant, a more natural solution should be to model.cpu() and right after model.cuda() ( but with ddp compatibility)
Or being able to make a copy to cpu of the model without altering the gpu memory…