How to calculate meters in Pytorch1.1 & DistributedDataParallel()?

I want to use model parallel and data parallel at the same time, and have read many docs and tutorials from official website.
One confusing problem I faced is how to collect all kinds of meter values in each Process?

Question1: In the official tutorial, they just record meters value in each Process.
But in my code, I print loss value in each process, they are different. So, I think the value of other meters are also different.
Is that tutorial wrong? In my opinion, I think the right way should synchronize loss, acc and other meters first, then all processes maintain the same values, after that I just need to print meters information in one Process.

Question2: In the official tutorial, they say ‘the DistributedDataParallel module also handles the averaging of gradients across the world, so we do not have to explicitly average the gradients in the training step’.
But, because of question1, does the API actually work as what the tutorial said? Because each of the processes has a different loss value, although they start from the same init weights, will model weights in each process be optimized in different directions?

Hi @StuChen,

  1. The losses are different because different processes see different inputs with different labels and therefore produce different losses. If you’re looking for a global top1 or top5, you can use distributed primitives from torch.distributed to average them.
  2. The model in each process will be optimized in the same way, because after calling loss.backward() the resulting gradients are identical across processes. In combination with the initial weights being identical, the resulting weights after optimization are also identical.