Why values become very large after dist.all_reduce

>>> print(f'{rank=}, before reduce, {loss=}')
rank=0 before reduce, loss=0.004893303848803043
rank=1 before reduce, loss=0.008418125100433826
rank=5 before reduce, loss=0.022900601848959923
rank=4 before reduce, loss=0.033665977805630645
rank=6 before reduce, loss=0.05732813761557371
rank=7 before reduce, loss=0.006465559359639883
rank=2 before reduce, loss=0.01541353389620781
rank=3 before reduce, loss=0.035168059170246124

>>> dist.all_reduce(loss.div_(dist.get_world_size()))
>>> print(f'{rank=}, after reduce, {loss=},  {world_size=}')
rank=0 after reduce, loss=-8.541948720476135e+27, world_size=8
rank=4 after reduce, loss=0.011374264427650382, world_size=8
rank=5 after reduce, loss=-8.541948720476135e+27, world_size=8
rank=1 after reduce, loss=-8.541948720476135e+27, world_size=8
rank=6 after reduce, loss=0.011374264427650382, world_size=8
rank=2 after reduce, loss=-8.541948720476135e+27, world_size=8
rank=7 after reduce, loss=-8.541948720476135e+27, world_size=8
rank=3 after reduce, loss=-8.541948720476135e+27, world_size=8

That’s strange, the tensors should be identical after all_reduce, could you please post some more details about your setup (e.g., how you are running torch.distributed such as the backend being used)?