I am training models with DDP. I started several processes manually and each process is responsible for the training on one certain GPU. When the model is trained in the processes, I send messages to these processes. I would like these processes would stop training the current model, release GPU memory of the old model, and start training another model after receiving such a message.
However, if the training of these processes are not “synchronized”, I can never successfully release the memory of the “old model” in GPU.
I am releasing the memory like this:
def destruction(self):
torch.cuda.synchronize()
del self.optimizer
del self.ddp_model
del self.train_loader
torch.cuda.empty_cache()
For example, when process A receives the message, it is still in the ith iteration of training; when process B receives the message, it has already enters the (i+1)th iteration. Then, both processes enters destruction()
(because they both received the message). Then, the processes will hangs there.
Is there any way to make sure that the training on each process are “synchronized” before they try to release the memory of the “old model” Thanks!