Recommended practice of using tensorboard in multi-gpu training

I’ve been using DDP for all my distributed training and now would like to use tensorboard for my visualization/logging. The only solution I can think of is to use “gather” in the rank 0 process each time I want to log an item to the board, since each process/GPU only has a subset of the data and statistics. Apparently this is somewhat cumbersome and I’m not sure if this hurts distributed efficiency. I wonder:

  1. if doing so indeed hurts distributed efficiency

  2. if there are recommended practices when it comes to use tensorboard in the DDP setting? If so, what are they?

Thanks!

@CDhere 1. all_gather involves communication sync among all ranks, so it has overhead. Depending on your application latency, you can measure the ratio btw the tensor board all gather overhead and total application latency. Based on the measured overhead, you can decide how frequently you want to dump data to tensor board. 2. as far as I know Tensorboard does not natively all gather info from multiple ranks. All gather by application should be a good way for now.