Reproducibility of data got from Dataloader with single/multiple threads

Hi,

I am using torch.utils.data.DataLoader with a customized dataset with data randomization (e.g. random clipping). I find DataLoader seems to give different data with num_workers=0 and with other num_workers values. An example code is as below:

import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader

def rnd_clip(d, l=20):     
     idx_st = np.random.randint(0, len(d)-l)  
     print(idx_st)   
     return d[idx_st:idx_st+l]

class DummyDataset(Dataset):
     def __init__(self):
          self.data = np.random.randn(4,100)

     def __len__(self):
          return len(self.data)

     def __getitem__(self, index):
          tmp_data = self.data[index]
          tmp_data = rnd_clip(tmp_data)
          return tmp_data

if __name__ == "__main__":
     seed = 1254
     num_workers = 16
     np.random.seed(seed)
     torch.random.manual_seed(seed)

     dataset = DummyDataset()
     dataloader = DataLoader(dataset, batch_size=16, shuffle=True, num_workers=num_workers)
     for batch_idx, data in enumerate(dataloader):
          print(data.std())

The printed std seems the same if num_workers is not 0. But if you set num_workers to 0, the printed std will be different. What’s the cause of this? And is there any way to fix it and get consistent data? Thanks!

You would need to seed numpy and other libs in the worker_init_fn as described here.

Thanks for your reply. I have tried to update the above demo with the instructions in your link. But I still get different results with num_workers=0 and other values of num_workers. Please can you have a look on what’s wrong in my seed settings? Thanks! The updated code is as below (I changed the seed since this one leads to larger difference):

import torch
import random
import numpy as np
from torch.utils.data import Dataset, DataLoader

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

def rnd_clip(d, l=20):     
     idx_st = np.random.randint(0, len(d)-l)  
     print(idx_st)   
     return d[idx_st:idx_st+l]

class DummyDataset(Dataset):
     def __init__(self):
          self.data = np.random.randn(4,100)

     def __len__(self):
          return len(self.data)

     def __getitem__(self, index):
          tmp_data = self.data[index]
          tmp_data = rnd_clip(tmp_data)
          return tmp_data

if __name__ == "__main__":
     seed = 1365
     num_workers = 0
     np.random.seed(seed)
     torch.random.manual_seed(seed)

     g = torch.Generator()
     g.manual_seed(seed)

     dataset = DummyDataset()
     dataloader = DataLoader(dataset, batch_size=16, shuffle=True, num_workers=num_workers, worker_init_fn=seed_worker, generator=g)
     for batch_idx, data in enumerate(dataloader):
          print(data.std())

Sorry, I might have misunderstood your actual use case.
Are you trying to get the same random numpy values in a single process vs. multiple processes?
In your single process code the numpy is seeded in the main process and will thus return the deterministic values in np.random.randint(0, len(d)-l) for all sequential calls.
In your multi-processing use case you could seed numpy in each process using the same (this would be dangerous) or different seeds.
Creating a mapping between both use cases might not be easily possible, since the execution order is different.

I see. Thank you very much!