Does getitem of dataloader reset random seed?

thnkim · September 29, 2017, 11:47pm

If I add a following code to getitem of cifar.py in torchvision,

    def __getitem__(self, index):
        ...
        # doing this so that it is consistent with all other datasets
        # to return a PIL Image
        img = Image.fromarray(img)

        if index == 0:    # outputs a random number for debugging
            print(np.random.uniform(-1, 1))

        if self.transform is not None:
            img = self.transform(img)
        ...

The line print(np.random.uniform(-1, 1)) always outputs the same value for every epoch.

I tried to make my own dataloader without torchvision, but I observed the same issue.
I never called numpy.random.seed().

Interesting thing is, when I tried

kwargs = {'num_workers': 1, 'pin_memory': True} if args.cuda else {}
train_loader = torch.utils.data.DataLoader(
    dataset('../data', train=True, download=True, transform=transform),
    batch_size=args.batch_size, shuffle=True, **kwargs)

for data in enumerate(train_loader):
    numpy.random.uniform(-1, 1)    # This dummy line was intentionally added
    pass

With the dummy line above (the first line in for loop) makes getitem generate non-repetitive values.
Without the line, I get the same value.

Thank you.

smth · September 30, 2017, 2:58pm

the workers are re-created on every epoch. Hence i think you are seeing the RNG start again with the default seed at the beginning of every epoch.

smth · September 30, 2017, 3:01pm

This is the small (hopefully easy to follow function):

github.com

pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L129


        if re.search('[SaUO]', elem.dtype.str) is not None:
            raise TypeError(error_msg.format(elem.dtype))


        return torch.stack([torch.from_numpy(b) for b in batch], 0)
    if elem.shape == ():  # scalars
        py_type = float if elem.dtype.name.startswith('float') else int
        return numpy_type_map[elem.dtype.name](list(map(py_type, batch)))
elif isinstance(batch[0], int_classes):
    return torch.LongTensor(batch)
elif isinstance(batch[0], float):
    return torch.DoubleTensor(batch)
elif isinstance(batch[0], string_classes):
    return batch
elif isinstance(batch[0], collections.Mapping):
    return {key: default_collate([d[key] for d in batch]) for key in batch[0]}
elif isinstance(batch[0], collections.Sequence):
    transposed = zip(*batch)
    return [default_collate(samples) for samples in transposed]


raise TypeError((error_msg.format(type(batch[0]))))

On each __iter__ the DataLoader returns a new one of these iterators: https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L302

The iterator creates a new self.workers everytime it’s created: https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L151-L155

thnkim · September 30, 2017, 4:28pm

Thank you!

and I found I should not use numpy.random.seed() for multi-threading:

‘each subprocess will inherit the RNG state from its parent, meaning that you will get identical random variates in each subprocess’

in in python - Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`? - Stack Overflow)

Marvin · October 9, 2017, 5:08pm

Is this meaning that the same data augmentation is applied to data at different workers? If the batch size equals num_workers, there is very high chance that this batch has been transformed with exactly the same operations (rotation, translation, flip, etc).

If this is the case, would calling np.random.seed() and random.seed() before each transformation make those transforms in a batch different?

thnkim · October 13, 2017, 4:00am

Well, the number of workers was 1.
I mean the i-th iteration of for-loop in the code below generates the same random value.
I created data-set once, created data-loader at every epoch.
When I used random.*, no problem, but I had the issue when I used numpy.random.*.

for epoch in range(100):
    kwargs = {'num_workers': 1, 'pin_memory': True} if args.cuda else {}
    train_loader = torch.utils.data.DataLoader(
        dataset('../data', train=True, download=True, transform=transform),
        batch_size=args.batch_size, shuffle=True, **kwargs)
    #
    for data in enumerate(train_loader):
        do_something()

smth · October 14, 2017, 8:40am

@thnkim it is because numpy’s RNG is not forkable.

Instead, add this line to the top of your main script (and you need to use python 3)

import torch
import torch.multiprocessing as mp
mp.set_start_method('spawn')

ignacio-rocco · October 19, 2017, 6:59am

Hi, my fix was to pass a random seed to each worker in the following way:

In DataLoaderIter:

self.workers = [
            multiprocessing.Process(
                target=_worker_loop,
                args=(self.dataset, self.index_queue, self.data_queue, self.collate_fn, np.random.randint(0, 4294967296, dtype='uint32')))
            for _ in range(self.num_workers)]

In the worker loop:

def _worker_loop(dataset, index_queue, data_queue, collate_fn, rng_seed):
global _use_shared_memory
_use_shared_memory = True

np.random.seed(rng_seed)  # local random seed for each worker
torch.set_num_threads(1)
while True:
    r = index_queue.get()
    if r is None:
        data_queue.put(None)
        break
    idx, batch_indices = r
    try:
        samples = collate_fn([dataset[i] for i in batch_indices])
    except Exception:
        data_queue.put((idx, ExceptionWrapper(sys.exc_info())))
    else:
        data_queue.put((idx, samples))

Then, setting a global seed with numpy.random.seed makes the code reproducible, while keeping the random numbers diverse across workers.

Rishi_Rawat · November 4, 2017, 8:45am

My Soln:

Somewhere else:
import datetime

in the dataset class
def _ call _ :
numpy.random.seed(datetime.datetime.now().second + datetime.datetime.now().millisecond)
…

seems to be working ok

Royi · November 10, 2017, 5:54pm

New user question.
I use shuffle in the Data Loader with number of workers set to 0.
I set at the beginning of the script torch.manual_seed(seedNumber).

I was expecting that form now on results with the same parameters to be reproducible yet they are not.
Is there a way to set the seed number of the random number generator used by the shuffle property?

I also use Dropout Layer, will the seed number set it as well?

Thank You.

SimonW · January 19, 2018, 6:46am

You may also need to set torch.backends.cudnn.deterministic = True.

Royi · February 10, 2018, 8:56pm

Is there a place which describe exactly what’s needed to able to reproduce results on any system?

ZhengRui · February 23, 2018, 4:19pm

I met same issue today, in my dataset’s __getitem__, i used torch.multinomial or np.random.choice to draw some samples, and in the dataloader i specified 8 workers, and in every continuous 8 batches, the drawed samples are the same.

SimonW · February 23, 2018, 5:30pm

@ZhengRui umPy’s RNG is duplicated on forking multiprocess. See dataloader’s worker_init_fn option for how to solve it.

ZhengRui · February 23, 2018, 6:10pm

@SimonW, thanks for the quick reply, i just read the documentation and now creating an virtualenv for pytorch3.1, was using pytorch2. Seems i don’t need to use woker_init_fn option, each worker already generate different sequences due to base_id+worker_id used to seed each worker internally.

SimonW · February 23, 2018, 7:11pm

Yeah if you are using torch.multinomial then it is good already But you need to set seed if you use numpy.

jia_lee · February 22, 2019, 7:22am

Hi, I’m new to pytorch. Do you mean we do not need to reset the seed for every worker if we use pytorch Dataloader?

jia_lee · February 22, 2019, 8:00am

Hi，if we use random rather than np.random, do we need to reset the random seed for every worker when we use the multi-process in Dataloader?

SimonW · February 22, 2019, 4:51pm

You shouldn’t set random seed in getitem, and should only set the numpy one in worker_init_fn if you use numpy. See https://pytorch.org/docs/master/notes/faq.html#my-data-loader-workers-return-identical-random-numbers

jia_lee · February 23, 2019, 1:28am

Thank you for your help! Your hint is useful. After some googling, if we use python random package (not NumPy.random), we don’t need to reset the random seed any more.

Does __getitem__ of dataloader reset random seed?

Does getitem of dataloader reset random seed?