How can we do inference for the weight file trained by DDP (2 GPUs)?

salidw · December 17, 2022, 12:50pm

I found the below error when I did inference in the same way of weight file trained by single GPU.

“RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.”

Should I still need to initialize the distributed for inference on a single GPU ?

Please advise. Thanks.

agu · December 17, 2022, 3:10pm

If you have a DistributedDataParallel model ddp_model, you can try to save the model parameters via sd = ddp_model.module.state_dict(). (Note the extra .module.) This saves the model parameters as if they were from a local non-DDP-version of the model. In that case, you should be able to load the model parameters via local_model.load_state_dict(sd) when running on a single CPU process / single GPU and without initializing a process group.

salidw · December 18, 2022, 7:43am

Hi Andrew, noted and thanks. Let me try and update you the result

salidw · December 18, 2022, 9:10am

@agu I had revised my state_dict by using “model_ddp.module.state_dict()” and then used torch.save(dict_model,‘‘weight_ddp.pth’’). When I run my inference.py. There was still RuntimeError at the line of
==> “checkpoint = torch.load(checkpoint_file)”

The codes are:
‘’’
dict_model = {
‘model’: model_ddp,
‘state_dict’: model_ddp.module.state_dict(),
‘optimizer’: optimizer.state_dict(),
‘epoch’: epoch+1,
}
‘’’

checkpoint_file = ‘weight_ddp.pth’
checkpoint = torch.load(checkpoint_file) # RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
model = checkpoint[‘model’]
model.load_state_dict(checkpoint[‘state_dict’])
model.eval()

ptrblck · December 18, 2022, 9:56pm

Avoid storing the entire model and just save the state_dict instead.
In particular, remove the 'model': model_ddp key-value pair from dict_model.

salidw · December 19, 2022, 6:55am

@ptrblck Noted and thanks.