I am training models with DDP. I started several processes manually and each process is responsible for the training on one certain GPU. When the model is trained in the processes, I send messages to these processes. I would like these processes would stop training the current model, release GPU memory of the old model, and start training another model after receiving such a message.
However, if the training of these processes are not “synchronized”, I can never successfully release the memory of the “old model” in GPU.
I am releasing the memory like this:
def destruction(self): torch.cuda.synchronize() del self.optimizer del self.ddp_model del self.train_loader torch.cuda.empty_cache()
For example, when process A receives the message, it is still in the ith iteration of training; when process B receives the message, it has already enters the (i+1)th iteration. Then, both processes enters
destruction() (because they both received the message). Then, the processes will hangs there.
Is there any way to make sure that the training on each process are “synchronized” before they try to release the memory of the “old model” Thanks!