Saving checkpoints with DistributedDaraParallel

Hi everyone,
I was trying to implement a distributed training for my model and I have found to have an issue when a new checkpoint (best model) has to be saved on disk. Whenever I try to execute my training with 2 processes (one process per GPU) using DistributedDataParallel, the process with rank 1 stops without any error in the output while the master process continues to work for a while (I think then it will be stopped since the process 1 is no more able to sync).
This problem arises whenever I use the following piece of code to save my checkpoint:

        if rank == 0 and loss_trend['dev'][-1] < best_loss:
            best_loss = loss_trend['dev'][-1]
            if cfg.LOGGING.CHECKPOINTS:
                assert ckp_file, 'Checkpoint file not defined'
                state_dict = model.module.cpu().state_dict()
                torch.save(state_dict, ckp_file)
            model.cuda(gpu)
            logger.log('best model saved at: {}'.format(ckp_file))

The problem is fixed I use deepcopy on my model without touching (without moving to cpu) the original one:

        if rank == 0 and loss_trend['dev'][-1] < best_loss:
            best_loss = loss_trend['dev'][-1]
            if cfg.LOGGING.CHECKPOINTS:
                assert ckp_file, 'Checkpoint file not defined'
                state_dict = copy.deepcopy(model.module).cpu().state_dict()
                torch.save(state_dict, ckp_file)
            logger.log('best model saved at: {}'.format(ckp_file))

Since the deepcopy introduces some overhead, I want to know why it works in this case and if there are available other methods to solve my problem. Here my model is wrapped using DistributedDataParallel.
Thank you

What is the error that you encounter when you don’t deepcopy the model? Does rank 0 just get stuck and rank 1 exits successfully? If so, do you know where rank 0 is stuck?