Reproducibility of data got from Dataloader with single/multiple threads

pepper8362 · September 23, 2022, 7:15am

Hi,

I am using torch.utils.data.DataLoader with a customized dataset with data randomization (e.g. random clipping). I find DataLoader seems to give different data with num_workers=0 and with other num_workers values. An example code is as below:

import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader

def rnd_clip(d, l=20):     
     idx_st = np.random.randint(0, len(d)-l)  
     print(idx_st)   
     return d[idx_st:idx_st+l]

class DummyDataset(Dataset):
     def __init__(self):
          self.data = np.random.randn(4,100)

     def __len__(self):
          return len(self.data)

     def __getitem__(self, index):
          tmp_data = self.data[index]
          tmp_data = rnd_clip(tmp_data)
          return tmp_data

if __name__ == "__main__":
     seed = 1254
     num_workers = 16
     np.random.seed(seed)
     torch.random.manual_seed(seed)

     dataset = DummyDataset()
     dataloader = DataLoader(dataset, batch_size=16, shuffle=True, num_workers=num_workers)
     for batch_idx, data in enumerate(dataloader):
          print(data.std())

The printed std seems the same if num_workers is not 0. But if you set num_workers to 0, the printed std will be different. What’s the cause of this? And is there any way to fix it and get consistent data? Thanks!

ptrblck · September 23, 2022, 7:45am

You would need to seed numpy and other libs in the worker_init_fn as described here.

pepper8362 · September 23, 2022, 8:06am

Thanks for your reply. I have tried to update the above demo with the instructions in your link. But I still get different results with num_workers=0 and other values of num_workers. Please can you have a look on what’s wrong in my seed settings? Thanks! The updated code is as below (I changed the seed since this one leads to larger difference):

import torch
import random
import numpy as np
from torch.utils.data import Dataset, DataLoader

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

def rnd_clip(d, l=20):     
     idx_st = np.random.randint(0, len(d)-l)  
     print(idx_st)   
     return d[idx_st:idx_st+l]

class DummyDataset(Dataset):
     def __init__(self):
          self.data = np.random.randn(4,100)

     def __len__(self):
          return len(self.data)

     def __getitem__(self, index):
          tmp_data = self.data[index]
          tmp_data = rnd_clip(tmp_data)
          return tmp_data

if __name__ == "__main__":
     seed = 1365
     num_workers = 0
     np.random.seed(seed)
     torch.random.manual_seed(seed)

     g = torch.Generator()
     g.manual_seed(seed)

     dataset = DummyDataset()
     dataloader = DataLoader(dataset, batch_size=16, shuffle=True, num_workers=num_workers, worker_init_fn=seed_worker, generator=g)
     for batch_idx, data in enumerate(dataloader):
          print(data.std())

ptrblck · September 23, 2022, 9:09am

Sorry, I might have misunderstood your actual use case.
Are you trying to get the same random numpy values in a single process vs. multiple processes?
In your single process code the numpy is seeded in the main process and will thus return the deterministic values in np.random.randint(0, len(d)-l) for all sequential calls.
In your multi-processing use case you could seed numpy in each process using the same (this would be dangerous) or different seeds.
Creating a mapping between both use cases might not be easily possible, since the execution order is different.

pepper8362 · September 26, 2022, 5:39am

I see. Thank you very much!