Model evaluation after DDP training

This is maybe a more general question, but I cannot find information about this anywhere.
There are a lot of tutorials how to train your model in DDP, and that seems to work for me fine. However, once the training is done, how do you do the evaluation?
When train on 2 nodes with 4 GPUs each, and have dist.destroy_process_group() after training, the evaluation is still done 8 times, with 8 different results. In my training loop I save the model using,''), only on rank 0. Then for the testing loop I load the model using model.load_state_dict(torch.load('./', map_location={'cuda:%d' % 0: 'cuda:%d' % device})). Additionally, for the testing I set test_sampler=None.
If needed I could provide more details of my code but I only listed the things which I think might be relevant.

One thing that would be helpful for debugging would be to print out the parameter values for each of the GPUs (e.g. you could print the norm of each parameter of your model). If the values diverge on each GPU, then you may want to brodcast the value of each parameter to all GPUs before you destroy the process group.

This is somehow weird.
From your description, I think the all the 8 GPUs (processes) loading same parameters and data during inference. There should be no difference on results across these processes, unless your model contains some undetermined operations that related to the random seed.

My suggestion is to narrow down the issue: draw one data sample that would produce the difference, then print the network output layer by layer.