What changes we need to make in metrics calculation when we are using Distributed Data Parallel for Multi-GPU training

Bilal_Yousaf · June 20, 2021, 9:50pm

I am updating my training script to use Distributed Data Parallel to do Multi-GPU training.
I am done with most of the steps as mentioned in PyTorch Guidelines.
But I am confused about how to handle metrics calculation and visualization:
For example,
I need to calculate accuracy and I have 4 samples in total and 2 GPUs.
When I run testing I will have predictions and ground truths for 2 2 samples in each process.
Now if I want to calculate accuracy do I need to call dist.reduce or it is not needed and I can directly calculate accuracy in rank 0 process.

gcramer23 · June 21, 2021, 3:19pm

I need to calculate accuracy and I have 4 samples in total and 2 GPUs.
When I run testing I will have predictions and ground truths for 2 2 samples in each process.
Now if I want to calculate accuracy do I need to call dist.reduce or it is not needed and I can directly calculate accuracy in rank 0 process.

Do you need to calculate local metrics or global metrics? To calculate global metrics, you would need to communicate between the processes Distributed communication package - torch.distributed — PyTorch 2.1 documentation.

Bilal_Yousaf · June 22, 2021, 3:15pm

Thanks a lot for the quick reply. It was really helpful. I have understood how to use reduce functions to solve the issue and I can get correct metrics as I was getting for single GPU.

Could you please share if there is any way I can gather dictionaries created on multiple processes for multi-gpu on rank 0 process as for now I think all such reduce/gather functions work for only tensors?

CBGxd · June 23, 2021, 12:33am

Could I just use the rank 0 to calculate everything? I know that the speed will be slower (which is not obvious in my case), but the accuracy should be same, right?

Bilal_Yousaf · June 23, 2021, 5:21pm

For calculating global accuracy on rank0, don’t change sampler of TestDataloader to DDP sampler. And only use rank0 to run the testing loop.