Saving checkpoints with DistributedDaraParallel

Seo · December 3, 2020, 9:34am

Hi everyone,
I was trying to implement a distributed training for my model and I have found to have an issue when a new checkpoint (best model) has to be saved on disk. Whenever I try to execute my training with 2 processes (one process per GPU) using DistributedDataParallel, the process with rank 1 stops without any error in the output while the master process continues to work for a while (I think then it will be stopped since the process 1 is no more able to sync).
This problem arises whenever I use the following piece of code to save my checkpoint:

        if rank == 0 and loss_trend['dev'][-1] < best_loss:
            best_loss = loss_trend['dev'][-1]
            if cfg.LOGGING.CHECKPOINTS:
                assert ckp_file, 'Checkpoint file not defined'
                state_dict = model.module.cpu().state_dict()
                torch.save(state_dict, ckp_file)
            model.cuda(gpu)
            logger.log('best model saved at: {}'.format(ckp_file))

The problem is fixed I use deepcopy on my model without touching (without moving to cpu) the original one:

        if rank == 0 and loss_trend['dev'][-1] < best_loss:
            best_loss = loss_trend['dev'][-1]
            if cfg.LOGGING.CHECKPOINTS:
                assert ckp_file, 'Checkpoint file not defined'
                state_dict = copy.deepcopy(model.module).cpu().state_dict()
                torch.save(state_dict, ckp_file)
            logger.log('best model saved at: {}'.format(ckp_file))

Since the deepcopy introduces some overhead, I want to know why it works in this case and if there are available other methods to solve my problem. Here my model is wrapped using DistributedDataParallel.
Thank you

pritamdamania87 · December 4, 2020, 11:31pm

What is the error that you encounter when you don’t deepcopy the model? Does rank 0 just get stuck and rank 1 exits successfully? If so, do you know where rank 0 is stuck?