What changes we need to make in metrics calculation when we are using Distributed Data Parallel for Multi-GPU training

I am updating my training script to use Distributed Data Parallel to do Multi-GPU training.
I am done with most of the steps as mentioned in PyTorch Guidelines.
But I am confused about how to handle metrics calculation and visualization:
For example,
I need to calculate accuracy and I have 4 samples in total and 2 GPUs.
When I run testing I will have predictions and ground truths for 2 2 samples in each process.
Now if I want to calculate accuracy do I need to call dist.reduce or it is not needed and I can directly calculate accuracy in rank 0 process.

I need to calculate accuracy and I have 4 samples in total and 2 GPUs.
When I run testing I will have predictions and ground truths for 2 2 samples in each process.
Now if I want to calculate accuracy do I need to call dist.reduce or it is not needed and I can directly calculate accuracy in rank 0 process.

Do you need to calculate local metrics or global metrics? To calculate global metrics, you would need to communicate between the processes Distributed communication package - torch.distributed — PyTorch 2.1 documentation.

1 Like

Thanks a lot for the quick reply. It was really helpful. I have understood how to use reduce functions to solve the issue and I can get correct metrics as I was getting for single GPU.

Could you please share if there is any way I can gather dictionaries created on multiple processes for multi-gpu on rank 0 process as for now I think all such reduce/gather functions work for only tensors?

Could I just use the rank 0 to calculate everything? I know that the speed will be slower (which is not obvious in my case), but the accuracy should be same, right?

For calculating global accuracy on rank0, don’t change sampler of TestDataloader to DDP sampler. And only use rank0 to run the testing loop.