DataParallel yields different results with different runs on same machine

Hi, I am using nn.DataParallel for distributed training on a machine with 4 A100 GPUs. I got different results (e.g., MAE, RMSE) when running the same piece of code on the same machine. However, my code yielded exactly the same results when I use only one GPU (set the number of visible GPU number to 1). Anyone have ideas why I got different results when using nn.DataParallel? Thank you.

Did you already set all deterministic flags and seeded the code as described in the Reproducibility docs?

Yes, I think so. Here is what I did in my code, let me know if I missed anything.

def set_seed(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


To further clarify this, when I set os.environ['CUDA_VISIBLE_DEVICES]=‘0’ (training on single GPU), I can repeat the results with different runs. However, when I put os.environ['CUDA_VISIBLE_DEVICES]=‘0,1,2,3’ (training with 4 GPUs) in my code, different results will be obtained with different runs.

DataParallel will involve communication synchronization using NCCL collectives, as far as I know, NCCL collectives may not be deterministic? how to make NCCL deterministic for different size for the same thing? · Issue #157 · NVIDIA/nccl · GitHub

Any suggestions on how to solve this problem? And will DDP also encounter this problem? Thank you.