Hi! I have a question about DistributedSampler.
I’m using DDP and I hope that my data loader can generate precisely the same data pack for each training (but of course different for each GPU).
However, even though I have set the seed and shuffle of DistributedSampler, I find the output data pack is not the same. So what’s the right way to fix the randomness of the data loader?
train_loader = DataLoader(
train_dset,
batch_size=1,
shuffle=False,
num_workers=8,
pin_memory=True,
sampler=DistributedSampler(train_dset, shuffle=True, seed=rank),
)
for batch_idx, data_pack in enumerate(train_loader):
print(rank, batch_idx, data_pack)
You should set same seed to DistributedSampler to get the get deterministic shuffling behavior.
See document related to seed argument: torch.utils.data — PyTorch 1.13 documentation
I found that the randomness of the input index of getitem has been fixed. But the output color is not. I found the reason is that I called np.random in getitem.
I tried to fix this randomness of numpy by setting the seed of numpy.random. I have tried multiple times and found that:
setting numpy.random.seed in getitem. Work, Both input index and output data is fixed.
setting numpy.random.seed right after mp.spawn(). Not work, input index is fixed but output data is not fixed. I think the reason is I called numpy.random to initialize my network before data sampling. But this should not destroy the fixed randomness. I don’t know the reason.
setting numpy.random.seed before mp.spawn(). Not work, input index is fixed but output data is not fixed.
So does this mean I should never use numpy.random but use torch.random in DDP? It’s so weird. Thanks!