I wish to train multiple models in turn in one python script. After the training of one model, I will run this to release memory, so that there will be enough memory for the training of the next model:
def destruction(self):
torch.cuda.synchronize(device=self._get_device())
dist.destroy_process_group(group=self.group)
del self.optimizer
del self.ddp_model
del self.train_loader
torch.cuda.set_device(device=self._get_device())
torch.cuda.empty_cache()
torch.cuda.synchronize(device=self._get_device())
However, from nvidia-smi
, I see that after calling destruction() each time, there was still some GPU memory allocated. And the unreleased memory increase as I train more model. For example, after training the 3rd model and calling destruction(), the memory allocation is like this:
Then, after training the 4th model, the memory allocation is like this:
Finally, this leads to OOM error in training.
Did I miss out some step to clear unused CUDA memory? Or did I forget to delete anything that remained in CUDA memory? I would really appreciate any help!