For my project I would like to calculate the prediction gain on a batch before and after training. To do so I need to get losses in eval mode before and after training on a batch. See pseudo code below.
1. loss_1 = model(batch) (in .eval() mode)
2. model(batch) (train mode)
3. loss_2 = model(batch) (in .eval() mode)
What will be a right way to maintain the synchronization for loss calculation across > 1 GPUs while using the distributed training with multiprocess?
Currently I am using torch.distributed.all_reduce()
to collect the losses across the gpus. The problem is when I am logging the losses to the text file, I get two different losses (if 2 gpus are used) during the same iteration:
{"losses": "(1057.0538330078125, 1056.30419921875)", "iepoch": 1, "iiter": 6}
{"losses": "(1082.687744140625, 1082.0916748046875)", "iepoch": 1, "iiter": 6}
where the losses
contain loss_1
and loss_2
.
Any suggestions why would it happen after doing .all_reduce()
? If the two losses are returned I would expect them to be the same.
Thank you!