I have read some tutorials about Distributed data parallel, however, I didn’t find out how to calculate train loss and accuracy after training one epoch correctly.

With DataParallel, we can easily calculate loss and accuracy since there is only one process. But with DDP, every gpu is running its own process and training its own data. The problem is,

How to evaluate the training accuracy correctly?

I follow the example here. ImageNet Example
Does the code redundantly calculate the same test accuray across multiple gpus? If so, is there any way to sample the testloader just like trainloader and avoid repetitive computing?

Yes. I use all-reduce function something like this:

import torch
import torch.distributed as dist
def global_meters_all_avg(args, *meters):
"""meters: scalar values of loss/accuracy calculated in each rank"""
tensors = [torch.tensor(meter, device=args.gpu, dtype=torch.float32) for meter in meters]
for tensor in tensors:
# each item of `tensors` is all-reduced starting from index 0 (in-place)
dist.all_reduce(tensor)
return [(tensor / args.world_size).item() for tensor in tensors]

Yes. And if you want to distributedly conducting evaluation, just follow how the example deal with training data. e.g. create a test_sampler for distribute data into GPUs.