Correct accuracy calculation with DistributedDataParallel

numpee · May 2, 2021, 9:44am

Hi,
I’m using DistributedDataParallel to train a simple classification model. I have some experience with distributed training, but I can’t seem to wrap my head around one specific detail.

Let me refer you to an example provided by PyTorch: examples/main.py at master · pytorch/examples · GitHub
Here, you will see that the accuracy is calculated by a accuracy() function, and the average accuracy is updated using the AverageMeter in the following lines.
From my understanding, this calculates the accuracy for the samples that each GPU receives, not the accuracy of samples across all GPUs. So, this function returns top1.avg, and now we head over to L248, where the model is saved if the accuracy from GPU rank 0 is larger than the best accuracy.

Am I going crazy, or is this intended behavior? Are we assuming that all GPUs receive the same samples, or that the accuracy on GPU0 is somehow representative of entire accuracy?

To show that my interpretation is correct, I wrote a short sandbox code that mimics the code attached above. The AverageMeter and accuracy() functions were all copied from the linked code base. This assumes a 2-class classification scenario, with batch_size=5, and I ran it on 4 GPUs:

acc_meter = AverageMeter()
model = nn.Linear(10, 2).cuda(gpu)
model = DistributedDataParallel(model, device_ids=[gpu])
a = torch.randn(5, 10).cuda(gpu)
gt = torch.randint(0, 2, (5, 1)).cuda(gpu)
outputs = model(a)
acc = accuracy(outputs, gt, topk=(1,))
acc_meter.update(acc[0], a.size(0))
print("Avg: ", acc_meter.avg)
print("Sum: ", acc_meter.sum)
print("Count: ", acc_meter.count)
return acc_meter.avg

There are two issues:

As suspected, returning acc.avg will only return the accuracy for current GPU. This means that saving a checkpoint or logging from rank=0 will only checkpoint or log the accuracy from rank=0.
The accuracy calculation is wrong. The accuracy() function divides by the batch_size, so returning acc_meter.avg divides by the batch_size again. The return value should be acc_meter.sum.

Ultimately, I would like to write code that uses DistributedDataParallel but can compute the accuracy correctly. For now, I have resorted to the following method:

Compute num_correct for all GPUs
all reduce num_correct, as well as num_samples as such: dist.all_reduce(num_correct), dist.all_reduce(num_samples). For this step, num_samples must be cast to GPU first.
Cast back CPU, then update the average meter.

To me, this does not seem like an elegant solution. Perhaps this could mean creating an AverageMeter that can handle Distributed updates? In search of a more elegant solution, I’ve looked at multiple code bases, but they all seem to do it incorrectly, as shown above. Am I completely missing something big here? If anyone has suggestions/solutions, I would love to hear from you.

rvarm1 · May 5, 2021, 8:25pm

Thanks for your question!

In the code you linked, a DistributedSampler is used to ensure that each GPU is fed different data which is the typical scenario in DDP training.

or that the accuracy on GPU0 is somehow representative of entire accuracy
In general, we take the model on an arbitrary GPU and use that to measure the accuracy, in this case, GPU 0. The actual model is the same across all GPUs though since gradients are synchronized.

The accuracy() function divides by the batch_size , so returning acc_meter.avg divides by the batch_size again. The return value should be acc_meter.sum .

Dones’t acc_meter.avg take the average accuracy across all the updates? Or does it actually also divide by the batch size? If so, this seems incorrect and an issue should be filed in the pytorch/examples repo.

Your approach if you would like to get the accuracy across all ranks seems correct. In general, maybe it would be good to have something like DistributedAverageMeter. Although in most training scenarios I’ve come across, it’s enough to evaluate the model on only one rank during training and then the entire dataset during eval. This of course assumes the data is shuffled properly and one rank doesn’t get all positive/negative samples, for example.

numpee · May 6, 2021, 1:57pm

Hi, thanks for the reply.

Regarding DistributedSampler:
What you mentioned is exactly the reason why I raised the question. Since each GPU is fed different data, the accuracy on GPU0 is not necessarily the accuracy on the entire dataset. But as you mentioned, if it is enough to evaluate the model on a single rank, this shouldn’t be an issue.

Regarding AverageMeter:
The acc_meter.avg should take the average accuracy across all updates, but it actually divides by the batch size again. If you look on L340, images.size(0) is passed as a second parameter. On the update() function at L376, the variable n represents the number of values (the count), and the sum is divided by the count to compute the average. Thus, we end up dividing by the batch size twice: once in the accuracy function, and once in the average meter. The second parameter in L340 should be left blank. But again, this is assuming that all batch sizes are same (sometimes the last batch is smaller).

I’ve raised an issue on the repo. I could open a pull request as well. The accuracy issue looks like a minor fix, but adding a DistributedAverageMeter may need some work