Distributed training. Synchronized loss calculation in eval mode before and after training on a minibatch

For my project I would like to calculate the prediction gain on a batch before and after training. To do so I need to get losses in eval mode before and after training on a batch. See pseudo code below.

1. loss_1 = model(batch) (in .eval() mode)
2. model(batch) (train mode)
3. loss_2 = model(batch) (in .eval() mode)

What will be a right way to maintain the synchronization for loss calculation across > 1 GPUs while using the distributed training with multiprocess?

Currently I am using torch.distributed.all_reduce() to collect the losses across the gpus. The problem is when I am logging the losses to the text file, I get two different losses (if 2 gpus are used) during the same iteration:

{"losses": "(1057.0538330078125, 1056.30419921875)", "iepoch": 1, "iiter": 6}
{"losses": "(1082.687744140625, 1082.0916748046875)", "iepoch": 1, "iiter": 6}

where the losses contain loss_1 and loss_2.

Any suggestions why would it happen after doing .all_reduce()? If the two losses are returned I would expect them to be the same.

Thank you!