I was training a model which takes around 16 GB during training and ~5GB during validation but when I try to save a checkpoint(after validation) the GPU seems to run out of memory. These numbers are for a batch size of 64, if I drop the batch size down to even 32 the memory required for training goes down to 9 GB but it still runs out of memory while trying to save the model.
I am saving only the
state_dict, using CUDA 8.0 with PyTorch 2.01 and running this on a 16 GB GPU.
Below is the stack trace for the error
THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=38 error=2 : out of memory Testing Results: Prec@1 52.974 Prec@5 80.941 Loss 1.75623 Traceback (most recent call last): File "main.py", line 264, in <module> main() File "main.py", line 109, in main }, is_best) File "main.py", line 213, in save_checkpoint torch.save(state, filename) File "/export/home/utsav/.local/lib/python2.7/site-packages/torch/serialization.py", line 120, in save return _save(obj, f, pickle_module, pickle_protocol) File "/export/home/utsav/.local/lib/python2.7/site-packages/torch/serialization.py", line 192, in _save serialized_storages[key]._write_file(f) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/csrc/generic/serialization.cpp:38 THCudaCheckWarn FAIL file=/pytorch/torch/lib/THC/THCStream.cpp line=50 error=2 : out of memory