Multiprocessing/ Distributed question regarding loss reporting

mobius · June 30, 2020, 10:10am

Hello there,

so I’ve been following the imagenet example https://github.com/pytorch/examples/blob/e9e76722dad4f4569651a8d67ca1d10607db58f9/imagenet/main.py) on how to use multiprocessing and I have a question on loss/ statistics reporting.

Currently, to my understanding, the way the example is being structured, every GPU on every Node gets one instance of the entire model in memory by spawning a separate process and executing the main_worker() method.

The imagenet example uses some custom classes like AverageMeter and ProgressMeter to report progress during the training/ validation. However from what I can tell each process will have its own meters/ progress to report.

Thus if I were to execute the example I would get a progress report for each of the processes running.

Instead of getting multiple progress reports and using console logging, I would like to use SummaryWriter and Tensorboard to monitor the progress of my training.

Going through documentation (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#save-and-load-checkpoints) I came across a section on loading checkpoints that states:

When using DDP, one optimization is to save the model in only one process and then load it to all processes, reducing write overhead. This is correct because all processes start from the same parameters and gradients are synchronized in backward passes, and hence optimizers should keep setting parameters to the same values.

That got me thinking, that perhaps I could create a SummaryWriter only on the process with rank == 0 and use that writer to only report statistics to tensorboard. I’ve implemented it as such and it seems to be working.

However I’d like to ask whether my approach is correct or not. Is there a better/ different way to do this?

mrshenli · June 30, 2020, 2:45pm

Hey @mobius, I believe this is correct. With DDP, every backward() pass is a global synchronization point. So all processes will run the same number of iterations and hence should have the same number of progress steps. Therefore, reporting the progress in one rank should be sufficient. (more details link1, link2)

mobius · June 30, 2020, 2:52pm

Thank you for the quick response @mrshenli! Great job on the distributed/ multiprocessing part of PyTorch, its really intuitive. Also your tutorials have been super helpful! I was not aware of the paper you’ve mentioned, I will definitely check it out!