I was training a model for 3D semantic segmentation, which impose a very heavy memory pressure.
After I fit data and model into GPU, everything went well until I tried to save a checkpoint using
torch.save. And I got the following trace back:
THCudaCheck FAIL file=/home/zhang/src/pytorch/torch/csrc/generic/serialization.cpp line=38 error=2 : out of memory Traceback (most recent call last): ... File "/home/zhang/pytorch/packages/torchmed/utils/trainer.py", line 152, in _snapshot torch.save(state_dict, filename) File "/home/zhang/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 120, in save return _save(obj, f, pickle_module, pickle_protocol) File "/home/zhang/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 192, in _save serialized_storages[key]._write_file(f) RuntimeError: cuda runtime error (2) : out of memory at /home/zhang/src/pytorch/torch/csrc/generic/serialization.cpp:38
Does that mean I should reserve some memory for checkpoint saving? If so, how much should I reserve?
BTW, when I make a checkpoint, the training and testing processes should have finished, and the output
loss, has been out of scope, which, I think, means the used GPU memory could be freed, so that there should be enough memory for the snapshoting. Am I wrong about the memory free mechanism?
Many many thanks for any suggestion!!