How to fix randomness of dataloader in DDP?

Hi! I have a question about DistributedSampler.
I’m using DDP and I hope that my data loader can generate precisely the same data pack for each training (but of course different for each GPU).
However, even though I have set the seed and shuffle of DistributedSampler, I find the output data pack is not the same. So what’s the right way to fix the randomness of the data loader?

    train_loader = DataLoader(
        train_dset,
        batch_size=1,  
        shuffle=False,
        num_workers=8,
        pin_memory=True,
        sampler=DistributedSampler(train_dset, shuffle=True, seed=rank),
    )
    for batch_idx, data_pack in enumerate(train_loader):
        print(rank, batch_idx, data_pack)

cc @ejguan @nivek for dataloader question.

You should set same seed to DistributedSampler to get the get deterministic shuffling behavior.
See document related to seed argument: torch.utils.data — PyTorch 1.13 documentation

1 Like

Thanks for your reply! I have set the DistributedSampler as below:

I found that the randomness of the input index of getitem has been fixed. But the output color is not. I found the reason is that I called np.random in getitem.
I tried to fix this randomness of numpy by setting the seed of numpy.random. I have tried multiple times and found that:

  1. setting numpy.random.seed in getitem. Work, Both input index and output data is fixed.
  2. setting numpy.random.seed right after mp.spawn(). Not work, input index is fixed but output data is not fixed. I think the reason is I called numpy.random to initialize my network before data sampling. But this should not destroy the fixed randomness. I don’t know the reason.
  3. setting numpy.random.seed before mp.spawn(). Not work, input index is fixed but output data is not fixed.

So does this mean I should never use numpy.random but use torch.random in DDP? It’s so weird. Thanks!

Check if you would need to re-seed numpy in the worker_init_fn as described here.