Tensorboard for Distributed Training using DistributedDataParallel

hugo_thim · June 11, 2021, 8:23am

Hello,

I have seen some questions related to using tensorboard with DistributedDataParallel(DDP) on the forum but I haven’t found a definitive answer to my question.

For instance, I wish to log loss values to tensorboard. When not considering DDP my code looks like the following for a loss item

loss_writer.add_scalar('Overall_loss', overall_loss.item(), total_iter)

where loss_writer = torch.utils.tensorboard.SummaryWriter(loss_dir).
Now, when considering DDP I included as recommended here to include

if args.rank == 0 :
    loss_writer.add_scalar('Overall_loss', overall_loss.item(), total_iter)

However, when doing so, I obtain something that looks like this
Selection_706

which has been evoked in here.

Any idea on how to obtain a smooth loss graph as on single-GPU training ?
Thanks !

cbalioglu · June 11, 2021, 4:23pm

I don’t have an immediate answer. Are you also making sure that SummaryWriter is only instantiated by rank 0 (as suggested by the link you referred to in your question)?

hugo_thim · June 23, 2021, 1:01pm

@cbalioglu Hello, thank you for your answer!
Unfortunatly, my SummaryWriter is in deed instanciated only by rank 0.