Reproducibility when using DDP in multi node multi gpu settings

I have a script that is set to be deterministic using the following lines:

seed = 0

I don’t use any non-deterministic algorithm. I use 5 machines with 8 gpus each. I use DDP with NCCL backend with one process per gpu. I always check the first loss value at the first feedforward to check the determinism. For the same script I used to get a different value when I started to run the script after that. So for the first run let’s say I get the loss value of 1.52 but any run after that gives me a different loss value that is always constant now. So for example, I get the loss value of 1.45 in run 2, 3, 4, 5, etc. I checked all my conda envs and I did not see any changes in the libraries. Is this normal? Is determinism possible with DDP? What do you think can be the cause of the first run giving a different loss value? I have read something in this forum that I might need to fix NCCL rings but I do not know how!

I have also checked the data stream and it hasn’t changed between runs from the beginning!

There can certainly be additional non-determinism at the NCCL level. Here is a GH thread with some discussion on this: how to avoid the precision loss(float32) caused by the gradient accumulation of Ring Allreduce in the case of ddp · Issue #48576 · pytorch/pytorch · GitHub.

Essentially you can try using the NCCL_RINGS environment variable to fix the rings. Depending on your compute setup, if the network topology changes between runs, there may be additional differences in terms of the allreduce configuration used by NCCL, and I would suggest reaching out to the NCCL team to see how it can be made completely deterministic.