What is the best practices of logging in distributed training?

XYJin · January 31, 2020, 10:08pm

Did some googling and found very few discussions on this matter. Is it best to perform all-reduce on, say, loss values, and keep track of it within the process with rank 0, like what the official tutorial recommends for checkpoints?

ibro45 · April 9, 2020, 11:24pm

That seems to be the case, check out how they do it in the implementation of Mask-RCNN. They use reduce instead of all_reduce because, for logging, you only need the reduced and averaged values on the rank 0 process.