Getting a eError: Default process group has not been initialized, please make sure to call init_process_group

ptrblck · April 1, 2023, 1:26am

Why would you have to train the model again if you save baseline’s state_dict? You would be able to create the baseline model afterwards and load its state_dict, wouldn’t you?

winchest · April 1, 2023, 3:00am

Thanks for your reply again! I might didn’t explain it clearly.
I save the baseline through this function in train.py:

torch.save(
                {
                    'model': get_inner_model(model).state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'rng_state': torch.get_rng_state(),
                    'cuda_rng_state': torch.cuda.get_rng_state_all(),
                    'baseline': baseline.state_dict()
                },
                os.path.join(opts.save_dir, 'epoch-{}.pt'.format(epoch))

As you can see, I save the baseline through baseline.state_dict() but the state_dict() here is defined by the baseline class itself which is:

    def state_dict(self):
        return {
            'model': self.model,
            'dataset': self.dataset,
            'epoch': self.epoch
        }

Therefore the “pt” document I have now include a whole model(the baseline part ) and the parameters (the model part). When I try to evaluate the model, I load the model part through:

load_data = torch_load_cpu(model_filename)
def torch_load_cpu(load_path):
    return torch.load(load_path, map_location=lambda storage, loc: storage)
model.load_state_dict({**model.state_dict(), **load_data.get('model', {})})

But when it runs to torch_load_cpu(), because of the baseline whole model part, I get the error: “RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.”
So I was wonder if there is a way to only load the model part.

ptrblck

1h

Why would you have to train the model again if you save baseline’s state_dict? You would be able to create the baseline model afterwards and load its state_dict, wouldn’t you?

Right now the ‘pt’ document I have only has baseline’s whole model.