Hi everyone,
I was trying to implement a distributed training for my model and I have found to have an issue when a new checkpoint (best model) has to be saved on disk. Whenever I try to execute my training with 2 processes (one process per GPU) using DistributedDataParallel, the process with rank 1 stops without any error in the output while the master process continues to work for a while (I think then it will be stopped since the process 1 is no more able to sync).
This problem arises whenever I use the following piece of code to save my checkpoint:
if rank == 0 and loss_trend['dev'][-1] < best_loss:
best_loss = loss_trend['dev'][-1]
if cfg.LOGGING.CHECKPOINTS:
assert ckp_file, 'Checkpoint file not defined'
state_dict = model.module.cpu().state_dict()
torch.save(state_dict, ckp_file)
model.cuda(gpu)
logger.log('best model saved at: {}'.format(ckp_file))
The problem is fixed I use deepcopy on my model without touching (without moving to cpu) the original one:
if rank == 0 and loss_trend['dev'][-1] < best_loss:
best_loss = loss_trend['dev'][-1]
if cfg.LOGGING.CHECKPOINTS:
assert ckp_file, 'Checkpoint file not defined'
state_dict = copy.deepcopy(model.module).cpu().state_dict()
torch.save(state_dict, ckp_file)
logger.log('best model saved at: {}'.format(ckp_file))
Since the deepcopy introduces some overhead, I want to know why it works in this case and if there are available other methods to solve my problem. Here my model is wrapped using DistributedDataParallel
.
Thank you