# Correct accuracy calculation with DistributedDataParallel

Hi,
I’m using DistributedDataParallel to train a simple classification model. I have some experience with distributed training, but I can’t seem to wrap my head around one specific detail.

Let me refer you to an example provided by PyTorch: examples/main.py at master · pytorch/examples · GitHub
Here, you will see that the accuracy is calculated by a `accuracy()` function, and the average accuracy is updated using the `AverageMeter` in the following lines.
From my understanding, this calculates the accuracy for the samples that each GPU receives, not the accuracy of samples across all GPUs. So, this function returns `top1.avg`, and now we head over to L248, where the model is saved if the accuracy from GPU rank 0 is larger than the best accuracy.

Am I going crazy, or is this intended behavior? Are we assuming that all GPUs receive the same samples, or that the accuracy on GPU0 is somehow representative of entire accuracy?

To show that my interpretation is correct, I wrote a short sandbox code that mimics the code attached above. The `AverageMeter` and `accuracy()` functions were all copied from the linked code base. This assumes a 2-class classification scenario, with `batch_size=5`, and I ran it on 4 GPUs:

``````acc_meter = AverageMeter()
model = nn.Linear(10, 2).cuda(gpu)
model = DistributedDataParallel(model, device_ids=[gpu])
a = torch.randn(5, 10).cuda(gpu)
gt = torch.randint(0, 2, (5, 1)).cuda(gpu)
outputs = model(a)
acc = accuracy(outputs, gt, topk=(1,))
acc_meter.update(acc[0], a.size(0))
print("Avg: ", acc_meter.avg)
print("Sum: ", acc_meter.sum)
print("Count: ", acc_meter.count)
return acc_meter.avg
``````

There are two issues:

1. As suspected, returning `acc.avg` will only return the accuracy for current GPU. This means that saving a checkpoint or logging from `rank=0` will only checkpoint or log the accuracy from `rank=0`.
2. The accuracy calculation is wrong. The `accuracy()` function divides by the `batch_size`, so returning `acc_meter.avg` divides by the `batch_size` again. The return value should be `acc_meter.sum`.

Ultimately, I would like to write code that uses `DistributedDataParallel` but can compute the accuracy correctly. For now, I have resorted to the following method:

1. Compute `num_correct` for all GPUs
2. all reduce `num_correct`, as well as `num_samples` as such: `dist.all_reduce(num_correct), dist.all_reduce(num_samples)`. For this step, `num_samples` must be cast to GPU first.
3. Cast back CPU, then update the average meter.

To me, this does not seem like an elegant solution. Perhaps this could mean creating an `AverageMeter` that can handle Distributed updates? In search of a more elegant solution, I’ve looked at multiple code bases, but they all seem to do it incorrectly, as shown above. Am I completely missing something big here? If anyone has suggestions/solutions, I would love to hear from you.

In the code you linked, a `DistributedSampler` is used to ensure that each GPU is fed different data which is the typical scenario in DDP training.

or that the accuracy on GPU0 is somehow representative of entire accuracy
In general, we take the model on an arbitrary GPU and use that to measure the accuracy, in this case, GPU 0. The actual model is the same across all GPUs though since gradients are synchronized.

The `accuracy()` function divides by the `batch_size` , so returning `acc_meter.avg` divides by the `batch_size` again. The return value should be `acc_meter.sum` .

Dones’t `acc_meter.avg` take the average accuracy across all the updates? Or does it actually also divide by the batch size? If so, this seems incorrect and an issue should be filed in the pytorch/examples repo.

Your approach if you would like to get the accuracy across all ranks seems correct. In general, maybe it would be good to have something like `DistributedAverageMeter`. Although in most training scenarios I’ve come across, it’s enough to evaluate the model on only one rank during training and then the entire dataset during eval. This of course assumes the data is shuffled properly and one rank doesn’t get all positive/negative samples, for example.

Regarding `DistributedSampler`:
Regarding `AverageMeter`:
The `acc_meter.avg` should take the average accuracy across all updates, but it actually divides by the batch size again. If you look on L340, `images.size(0)` is passed as a second parameter. On the `update()` function at L376, the variable `n` represents the number of values (the count), and the `sum` is divided by the `count` to compute the average. Thus, we end up dividing by the batch size twice: once in the accuracy function, and once in the average meter. The second parameter in L340 should be left blank. But again, this is assuming that all batch sizes are same (sometimes the last batch is smaller).
I’ve raised an issue on the repo. I could open a pull request as well. The accuracy issue looks like a minor fix, but adding a `DistributedAverageMeter` may need some work