How to sync the training in different processes?

gudiandian · August 17, 2021, 4:12pm

I am training models with DDP. I started several processes manually and each process is responsible for the training on one certain GPU. When the model is trained in the processes, I send messages to these processes. I would like these processes would stop training the current model, release GPU memory of the old model, and start training another model after receiving such a message.
However, if the training of these processes are not “synchronized”, I can never successfully release the memory of the “old model” in GPU.

I am releasing the memory like this:

def destruction(self):
        torch.cuda.synchronize()
        del self.optimizer
        del self.ddp_model
        del self.train_loader
        torch.cuda.empty_cache()

For example, when process A receives the message, it is still in the ith iteration of training; when process B receives the message, it has already enters the (i+1)th iteration. Then, both processes enters destruction() (because they both received the message). Then, the processes will hangs there.

Is there any way to make sure that the training on each process are “synchronized” before they try to release the memory of the “old model” Thanks!

cbalioglu · August 17, 2021, 4:20pm

Which data parallel API are you using? DataParallel or Distributed Data Parallel (aka DDP)? With DDP you can use torch.distributed.barrier() function to explicitly synchronize your workers. Check out our docs for further info.

gudiandian · August 17, 2021, 4:47pm

Thanks for such a quick reply!