Strange behavior for DataLoader with random numbers

Yuval_Reina · October 23, 2019, 8:52am

I was trying to use some randomness in a dataset I built, and got a very strange behavior.

For simplicity I reproduced this behavior in the following toy example:

import torch
import torch.utils.data as D

class TestDataset(D.Dataset):

    def __init__(self, mysize=20):

        super(TestDataset, self).__init__()
        self.mysize = mysize

    def __len__(self):
        return self.mysize

    def __getitem__(self, idx):
        
        return torch.tensor(np.random.randint(0,10,size=4))

td=TestDataset(16)
dl=D.DataLoader(td,batch_size=2,num_workers=4)

for d in dl:
    print (d)

The results I get are

tensor([[6, 2, 0, 6],
        [8, 4, 5, 3]])
tensor([[6, 2, 0, 6],
        [8, 4, 5, 3]])
tensor([[6, 2, 0, 6],
        [8, 4, 5, 3]])
tensor([[6, 2, 0, 6],
        [8, 4, 5, 3]])
tensor([[7, 4, 7, 0],
        [5, 6, 0, 6]])
tensor([[7, 4, 7, 0],
        [5, 6, 0, 6]])
tensor([[7, 4, 7, 0],
        [5, 6, 0, 6]])
tensor([[7, 4, 7, 0],
        [5, 6, 0, 6]])

Instead of random output, the output repeat itself according to the number of workers.

A Bug? A Feature?

ptrblck · October 23, 2019, 10:45am

Could reproduce this in 1.3.0 and it seems all workers use the same seed for each spawn.
@albanD is this expected as I cannot reproduce it in 1.3.0.dev20190919?

albanD · October 23, 2019, 3:46pm

I does ring a bell to have an PR fixing this.

albanD · October 23, 2019, 3:56pm

Here is the doc section that corresponds to it.

Yuval_Reina · October 23, 2019, 4:00pm

Thanks for the blazing fast response!

I understand it’s a feature and I need to use Pytorch’s randoms to get real randomness