Hi, I am using nn.DataParallel for distributed training on a machine with 4 A100 GPUs. I got different results (e.g., MAE, RMSE) when running the same piece of code on the same machine. However, my code yielded exactly the same results when I use only one GPU (set the number of visible GPU number to 1). Anyone have ideas why I got different results when using nn.DataParallel? Thank you.
Did you already set all deterministic flags and seeded the code as described in the Reproducibility docs?
Yes, I think so. Here is what I did in my code, let me know if I missed anything.
def set_seed(seed):
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
set_seed(args.seed)
To further clarify this, when I set os.environ['CUDA_VISIBLE_DEVICES]=‘0’ (training on single GPU), I can repeat the results with different runs. However, when I put os.environ['CUDA_VISIBLE_DEVICES]=‘0,1,2,3’ (training with 4 GPUs) in my code, different results will be obtained with different runs.
DataParallel will involve communication synchronization using NCCL collectives, as far as I know, NCCL collectives may not be deterministic? how to make NCCL deterministic for different size for the same thing? · Issue #157 · NVIDIA/nccl · GitHub
Any suggestions on how to solve this problem? And will DDP also encounter this problem? Thank you.