How to fix randomness of dataloader in DDP?

Yi_wen_Chen · November 22, 2022, 2:39am

Hi! I have a question about DistributedSampler.
I’m using DDP and I hope that my data loader can generate precisely the same data pack for each training (but of course different for each GPU).
However, even though I have set the seed and shuffle of DistributedSampler, I find the output data pack is not the same. So what’s the right way to fix the randomness of the data loader?

    train_loader = DataLoader(
        train_dset,
        batch_size=1,  
        shuffle=False,
        num_workers=8,
        pin_memory=True,
        sampler=DistributedSampler(train_dset, shuffle=True, seed=rank),
    )
    for batch_idx, data_pack in enumerate(train_loader):
        print(rank, batch_idx, data_pack)

H-Huang · November 22, 2022, 7:19pm

cc @ejguan @nivek for dataloader question.

ejguan · November 22, 2022, 7:49pm

You should set same seed to DistributedSampler to get the get deterministic shuffling behavior.
See document related to seed argument: torch.utils.data — PyTorch 1.13 documentation

Yi_wen_Chen · November 23, 2022, 6:56am

Thanks for your reply! I have set the DistributedSampler as below:

I found that the randomness of the input index of getitem has been fixed. But the output color is not. I found the reason is that I called np.random in getitem.
I tried to fix this randomness of numpy by setting the seed of numpy.random. I have tried multiple times and found that:

setting numpy.random.seed in getitem. Work, Both input index and output data is fixed.
setting numpy.random.seed right after mp.spawn(). Not work, input index is fixed but output data is not fixed. I think the reason is I called numpy.random to initialize my network before data sampling. But this should not destroy the fixed randomness. I don’t know the reason.
setting numpy.random.seed before mp.spawn(). Not work, input index is fixed but output data is not fixed.

So does this mean I should never use numpy.random but use torch.random in DDP? It’s so weird. Thanks!

ptrblck · November 23, 2022, 7:07am

Check if you would need to re-seed numpy in the worker_init_fn as described here.

Super_Kai · October 26, 2024, 8:41pm

You can fix the randomness of DataLoader() with torch.manual_seed() as shown below:

import torch
from torch.utils.data import DataLoader

torch.manual_seed(42)

my_tensor = torch.tensor([0, 1, 2, 3, 4, 5])

list(DataLoader(dataset=my_tensor, shuffle=True))
# [tensor([4]), tensor([5]), tensor([1]), tensor([0]), tensor([3]), tensor([2])]