@ mrshenli thanks again. I will try to answer all your inquiries with more detail in a bit today.
Unfortunately, I could not use your script as is because my already saved DDP (without “.module”) was already saved using a state_dict method.
So as for the minor changes, I did the following. :
def main(args):
torch.distributed.init_process_group(backend='nccl', init_method='env://')
test_loader = DataLoader(
test_dataset,
batch_size=args.test_batch_size,
shuffle=False,
num_workers=args.num_workers,
pin_memory=True)
model = get_model()
#############################################################
# My changes
torch.cuda.set_device(args.local_rank)
model = model.to([args.local_rank][0])
model = DDP(model, device_ids=[args.local_rank],
output_device=[args.local_rank][0])
checkpoint = torch.load(args.load_path) # , map_location=map_location)
state_dict = checkpoint['model_state_dict']
model.load_state_dict(state_dict)
##############################################################
dist.barrier()
test_function(model, test_loader, args.local_rank,args.load_path.with_suffix('.csv'))
I trained resnet18 from scratch. I just copied and used the resnet script locally.
As for your last two comments I did use just rank 0 to save the ddp, but I saved the state_dict() for ddp itself (without .module). That is why when I used your script I also had to remove the .module similar to this:
[solved] KeyError: ‘unexpected key “module.encoder.embedding.weight” in state_dict’
Is it correctly to do so?