I want to use model parallel and data parallel at the same time, and have read many docs and tutorials from official website.
One confusing problem I faced is how to collect all kinds of meter values in each Process?
Question1: In the official tutorial, they just record meters value in each Process.
But in my code, I print loss value in each process, they are different. So, I think the value of other meters are also different.
Is that tutorial wrong? In my opinion, I think the right way should synchronize loss, acc and other meters first, then all processes maintain the same values, after that I just need to print meters information in one Process.
Question2: In the official tutorial, they say ‘the DistributedDataParallel module also handles the averaging of gradients across the world, so we do not have to explicitly average the gradients in the training step’.
But, because of question1, does the API actually work as what the tutorial said? Because each of the processes has a different loss value, although they start from the same init weights, will model weights in each process be optimized in different directions?