I’m using an IterableDataset inside a DataLoader (multiple workers). Some of the stuff in my IterableDataset code calls numpy.random functions. I noticed after a while that in each epoch, the sequence of values returned by the random functions is exactly the same! In other words, every worker is (somehow) reset to the same random seed at the beginning of the epoch (or when it is created). So if (for example) the worker tried to do random image crops with positions from numpy.random, they are the same crops for each image for every epoch.
How/where is the seed set? Is this expected behavior?
WHY does this happen? I would expect the numpy.random seed in each worker to act the same as in a new process, unless numpy.random.seed is explicitly called by the user code.
(I was not doing anything explicit to set the seed, using either numpy or torch calls, or anything to make torch deterministic. This seems to just be the default behavior - torch modifyint numpy to make it deterministic without being requested to by the user)
Simple code to reproduce:
import torch
import numpy as np
from torch.utils.data import DataLoader, Dataset
class TestIterableDataset(torch.utils.data.IterableDataset):
def __init__(self):
super(TestIterableDataset).__init__()
def __iter__(self):
worker_info = torch.utils.data.get_worker_info()
for n in range(10):
yield(worker_info.id, np.random.randint(1000000))
ds = TestIterableDataset()
for worker_id, number in DataLoader(ds, batch_size=4, num_workers=2):
print(worker_id, number)
# This prints the same result every time it is run, and the same sequence from each worker:
# tensor([0, 0, 0, 0]) tensor([ 68669, 230721, 801136, 274196])
# tensor([1, 1, 1, 1]) tensor([ 68669, 230721, 801136, 274196])
# tensor([0, 0, 0, 0]) tensor([617084, 429589, 436968, 718987])
# tensor([1, 1, 1, 1]) tensor([617084, 429589, 436968, 718987])
# tensor([0, 0]) tensor([150977, 59469])
# tensor([1, 1]) tensor([150977, 59469])