CUDA memory not released by torch.cuda.empty_cache()

I wish to train multiple models in turn in one python script. After the training of one model, I will run this to release memory, so that there will be enough memory for the training of the next model:

def destruction(self):
    torch.cuda.synchronize(device=self._get_device())
    dist.destroy_process_group(group=self.group)
    del self.optimizer
    del self.ddp_model
    del self.train_loader
    torch.cuda.set_device(device=self._get_device())
    torch.cuda.empty_cache()
    torch.cuda.synchronize(device=self._get_device())

However, from nvidia-smi, I see that after calling destruction() each time, there was still some GPU memory allocated. And the unreleased memory increase as I train more model. For example, after training the 3rd model and calling destruction(), the memory allocation is like this:

Then, after training the 4th model, the memory allocation is like this:

Finally, this leads to OOM error in training.

Did I miss out some step to clear unused CUDA memory? Or did I forget to delete anything that remained in CUDA memory? I would really appreciate any help!

torch.cuda.empty_cache() would free the cached memory so that other processes could reuse it.
However, if you are using the same Python process, this won’t avoid OOM issues and will slow down the code instead.

Based on the reported issue I would assume that you haven’t deleted all references to the model, activations, optimizers, etc. so that some tensors are still alive.