Why changing the backends of ddp from NCCL to GLOO introduces a significant difference in train loss (model accuracy)

as described in [github issue]

I’ll echo what Natalia mentioned on the GH issue - it’s not guaranteed that the collectives (specifically allreduce) in both NCCL and Gloo will produce exactly the same results. In fact, there may be slight variations in different runs with the NCCL backend if there are slight changes in the network, topology, ring configuration, etc.

Both NCCL and Gloo support several different allreduce implementations, which can further be configured extensively using environment variables. These libraries choose an implementation based on the topology and other factors, so it is completely possible that they choose different implementations/configurations which leads to small differences in the allreduce results.