CUDA out of memory when calling model.state_dict()

I have been training a very large model which occupies over 90% GPU memory. The training process works fine until calling model.state_dict() to save checkpoints. It throws CUDA out of memory error.

The same problem has been reported in:

The suggested fix is to move the model to CPU device before saving checkpoints and then move the model back to GPU. However, I use distributed training and the training program freezes after model.cuda().

Any help is appreciated!