What is the best practices of logging in distributed training?

Did some googling and found very few discussions on this matter. Is it best to perform all-reduce on, say, loss values, and keep track of it within the process with rank 0, like what the official tutorial recommends for checkpoints?

That seems to be the case, check out how they do it in the implementation of Mask-RCNN. They use reduce instead of all_reduce because, for logging, you only need the reduced and averaged values on the rank 0 process.

1 Like