I have been training a very large model which occupies over 90% GPU memory. The training process works fine until calling model.state_dict() to save checkpoints. It throws CUDA out of memory error.
The same problem has been reported in:
The suggested fix is to move the model to CPU device before saving checkpoints and then move the model back to GPU. However, I use distributed training and the training program freezes after model.cuda().
Any help is appreciated!