I have a script that is set to be deterministic using the following lines:
seed = 0
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
I don’t use any non-deterministic algorithm. I use 5 machines with 8 gpus each. I use DDP with NCCL backend with one process per gpu. I always check the first loss value at the first feedforward to check the determinism. For the same script I used to get a different value when I started to run the script after that. So for the first run let’s say I get the loss value of 1.52 but any run after that gives me a different loss value that is always constant now. So for example, I get the loss value of 1.45 in run 2, 3, 4, 5, etc. I checked all my conda envs and I did not see any changes in the libraries. Is this normal? Is determinism possible with DDP? What do you think can be the cause of the first run giving a different loss value? I have read something in this forum that I might need to fix NCCL rings but I do not know how!
I have also checked the data stream and it hasn’t changed between runs from the beginning!