How can we do inference for the weight file trained by DDP (2 GPUs)?

I found the below error when I did inference in the same way of weight file trained by single GPU.

“RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.”

Should I still need to initialize the distributed for inference on a single GPU ?

Please advise. Thanks.

If you have a DistributedDataParallel model ddp_model, you can try to save the model parameters via sd = ddp_model.module.state_dict(). (Note the extra .module.) This saves the model parameters as if they were from a local non-DDP-version of the model. In that case, you should be able to load the model parameters via local_model.load_state_dict(sd) when running on a single CPU process / single GPU and without initializing a process group.

Hi Andrew, noted and thanks. Let me try and update you the result :slight_smile:

@agu I had revised my state_dict by using “model_ddp.module.state_dict()” and then used,‘‘weight_ddp.pth’’). When I run my There was still RuntimeError at the line of
==> “checkpoint = torch.load(checkpoint_file)”

The codes are:
dict_model = {
‘model’: model_ddp,
‘state_dict’: model_ddp.module.state_dict(),
‘optimizer’: optimizer.state_dict(),
‘epoch’: epoch+1,

checkpoint_file = ‘weight_ddp.pth’
checkpoint = torch.load(checkpoint_file) # RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
model = checkpoint[‘model’]

Avoid storing the entire model and just save the state_dict instead.
In particular, remove the 'model': model_ddp key-value pair from dict_model.

@ptrblck Noted and thanks.