How to calculate train accuracy with DDP

111519 · April 29, 2021, 5:32am

I have read some tutorials about Distributed data parallel, however, I didn’t find out how to calculate train loss and accuracy after training one epoch correctly.

With DataParallel, we can easily calculate loss and accuracy since there is only one process. But with DDP, every gpu is running its own process and training its own data. The problem is,

How to evaluate the training accuracy correctly?
I follow the example here. ImageNet Example
Does the code redundantly calculate the same test accuray across multiple gpus? If so, is there any way to sample the testloader just like trainloader and avoid repetitive computing?

111519 · April 29, 2021, 8:10am

Finally find out. It seems that it can not be easily done. We need to send message from other process and gather information together.

sio277 · April 29, 2021, 11:55am

Yes. I use all-reduce function something like this:

import torch
import torch.distributed as dist


def global_meters_all_avg(args, *meters):
    """meters: scalar values of loss/accuracy calculated in each rank"""
    tensors = [torch.tensor(meter, device=args.gpu, dtype=torch.float32) for meter in meters]
    for tensor in tensors:
        # each item of `tensors` is all-reduced starting from index 0 (in-place)
        dist.all_reduce(tensor)

    return [(tensor / args.world_size).item() for tensor in tensors]

huahuanZ · April 29, 2021, 12:21pm

For your questions:

Use all reduce method to communicate across processes;
Yes. And if you want to distributedly conducting evaluation, just follow how the example deal with training data. e.g. create a test_sampler for distribute data into GPUs.