Does __getitem__ of dataloader reset random seed?

If I add a following code to getitem of cifar.py in torchvision,

    def __getitem__(self, index):
        ...
        # doing this so that it is consistent with all other datasets
        # to return a PIL Image
        img = Image.fromarray(img)

        if index == 0:    # outputs a random number for debugging
            print(np.random.uniform(-1, 1))

        if self.transform is not None:
            img = self.transform(img)
        ...

The line print(np.random.uniform(-1, 1)) always outputs the same value for every epoch.

I tried to make my own dataloader without torchvision, but I observed the same issue.
I never called numpy.random.seed().

Interesting thing is, when I tried

kwargs = {'num_workers': 1, 'pin_memory': True} if args.cuda else {}
train_loader = torch.utils.data.DataLoader(
    dataset('../data', train=True, download=True, transform=transform),
    batch_size=args.batch_size, shuffle=True, **kwargs)

for data in enumerate(train_loader):
    numpy.random.uniform(-1, 1)    # This dummy line was intentionally added
    pass

With the dummy line above (the first line in for loop) makes getitem generate non-repetitive values.
Without the line, I get the same value.

Thank you.

2 Likes

the workers are re-created on every epoch. Hence i think you are seeing the RNG start again with the default seed at the beginning of every epoch.

4 Likes

This is the small (hopefully easy to follow function):

On each __iter__ the DataLoader returns a new one of these iterators: https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L302

The iterator creates a new self.workers everytime it’s created: https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L151-L155

1 Like

Thank you!

and I found I should not use numpy.random.seed() for multi-threading:

‘each subprocess will inherit the RNG state from its parent, meaning that you will get identical random variates in each subprocess’

in in https://stackoverflow.com/questions/31057197/should-i-use-random-seed-or-numpy-random-seed-to-control-random-number-gener)

1 Like

Is this meaning that the same data augmentation is applied to data at different workers? If the batch size equals num_workers, there is very high chance that this batch has been transformed with exactly the same operations (rotation, translation, flip, etc).

If this is the case, would calling np.random.seed() and random.seed() before each transformation make those transforms in a batch different?

Well, the number of workers was 1.
I mean the i-th iteration of for-loop in the code below generates the same random value.
I created data-set once, created data-loader at every epoch.
When I used random.*, no problem, but I had the issue when I used numpy.random.*.

for epoch in range(100):
    kwargs = {'num_workers': 1, 'pin_memory': True} if args.cuda else {}
    train_loader = torch.utils.data.DataLoader(
        dataset('../data', train=True, download=True, transform=transform),
        batch_size=args.batch_size, shuffle=True, **kwargs)
    #
    for data in enumerate(train_loader):
        do_something()

@thnkim it is because numpy’s RNG is not forkable.

Instead, add this line to the top of your main script (and you need to use python 3)

import torch
import torch.multiprocessing as mp
mp.set_start_method('spawn') 
4 Likes

Hi, my fix was to pass a random seed to each worker in the following way:

In DataLoaderIter:

self.workers = [
            multiprocessing.Process(
                target=_worker_loop,
                args=(self.dataset, self.index_queue, self.data_queue, self.collate_fn, np.random.randint(0, 4294967296, dtype='uint32')))
            for _ in range(self.num_workers)]

In the worker loop:

def _worker_loop(dataset, index_queue, data_queue, collate_fn, rng_seed):
global _use_shared_memory
_use_shared_memory = True

np.random.seed(rng_seed)  # local random seed for each worker
torch.set_num_threads(1)
while True:
    r = index_queue.get()
    if r is None:
        data_queue.put(None)
        break
    idx, batch_indices = r
    try:
        samples = collate_fn([dataset[i] for i in batch_indices])
    except Exception:
        data_queue.put((idx, ExceptionWrapper(sys.exc_info())))
    else:
        data_queue.put((idx, samples))

Then, setting a global seed with numpy.random.seed makes the code reproducible, while keeping the random numbers diverse across workers.

1 Like

My Soln:

Somewhere else:
import datetime

in the dataset class
def _ call _ :
numpy.random.seed(datetime.datetime.now().second + datetime.datetime.now().millisecond)

seems to be working ok

4 Likes

New user question.
I use shuffle in the Data Loader with number of workers set to 0.
I set at the beginning of the script torch.manual_seed(seedNumber).

I was expecting that form now on results with the same parameters to be reproducible yet they are not.
Is there a way to set the seed number of the random number generator used by the shuffle property?

I also use Dropout Layer, will the seed number set it as well?

Thank You.

You may also need to set torch.backends.cudnn.deterministic = True.

Is there a place which describe exactly what’s needed to able to reproduce results on any system?

I met same issue today, in my dataset’s __getitem__, i used torch.multinomial or np.random.choice to draw some samples, and in the dataloader i specified 8 workers, and in every continuous 8 batches, the drawed samples are the same.

@ZhengRui umPy’s RNG is duplicated on forking multiprocess. See dataloader’s worker_init_fn option for how to solve it.

@SimonW, thanks for the quick reply, i just read the documentation and now creating an virtualenv for pytorch3.1, was using pytorch2. Seems i don’t need to use woker_init_fn option, each worker already generate different sequences due to base_id+worker_id used to seed each worker internally.

Yeah if you are using torch.multinomial then it is good already :slight_smile: But you need to set seed if you use numpy.

Hi, I’m new to pytorch. Do you mean we do not need to reset the seed for every worker if we use pytorch Dataloader?

Hi,if we use random rather than np.random, do we need to reset the random seed for every worker when we use the multi-process in Dataloader?

You shouldn’t set random seed in getitem, and should only set the numpy one in worker_init_fn if you use numpy. See https://pytorch.org/docs/master/notes/faq.html#my-data-loader-workers-return-identical-random-numbers

Thank you for your help! Your hint is useful. After some googling, if we use python random package (not NumPy.random), we don’t need to reset the random seed any more.