Dataloaders multiprocess with torch.manual_seed

Digbalay_Bose · June 2, 2021, 6:31am

Hi,
I am currently fixing the seed values globally in my script using the following snippet:

seed_value=123457
np.random.seed(seed_value) # cpu vars
torch.manual_seed(seed_value) # cpu  vars
random.seed(seed_value) # Python
torch.cuda.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Further, I am using multiple processes in my DataLoader using num_workers > 0 to load frames for a video. Does the above snippet assign a fixed seed worker_id+seed_value to every worker in an epoch?
I am also using random augmentations as a part of my data loading pipeline like RandomResizedCrop, RandomHorizontal Flipping from torchvision transforms. If the worker seed is fixed at worker_id+seed_value , does that mean at each epoch, the data will go through the same set of augmentations?

If someone can clarify this, it would be of great help. Thanks!

tom · June 2, 2021, 6:47am

From the documentation:

By default, each worker will have its PyTorch seed set to base_seed + worker_id, where base_seed is a long generated by main process using its RNG (thereby, consuming a RNG state mandatorily). However, seeds for other libraries may be duplicated upon initializing workers (e.g., NumPy), causing each worker to return identical random numbers. (See this section in FAQ.).

So unless you reset the RNG between to epochs, the creation of the base seed from the RNG (which isn’t completely obvious to me that it is a good idea for perfect randomness) ensures that you get a new random seed every time.

Best regards

Thomas

P.S.: I would always recommend experimentally verifying things like this. Even just b1 = next(iter(dl)) ; b2 = next(iter(dl)) and inspecting the results.

Digbalay_Bose · June 2, 2021, 7:17am

Many thanks for the reply !

I cross-checked the generated batches using consecutive calls of next(iter(dl)). The batches are different.
So the base_seed generated by the main process is different than the seed fixed manually using torch.manual_seed? Also, if I interpret it, for each epoch, the workers will be assigned a new seed base_seed + worker_id. And the data augmentations will be based on the new worker seed?

tom · June 2, 2021, 7:37am

Yes, the base_seed is generated from drawing a random number, so it is dependent on / defined by the RNG state (and thus the manual_seed) but not identical to it.

One thing to keep in mind is that if enabled, the shuffeling of the dataset (i.e. which things to stick in a minibatch) is the done in the parent.
The randomness in the dataset (and potentially the collation function) like augmentations (in the dataset) are indeed from in the worker based on the new worker seed. All this assuming that you use PyTorch’s random functions (which you should and you need to be super-careful to properly initialize your RNG if you don’t).

All this said, there is a good argument to be made to do the random augmentation after batching on the GPU if possible. Solves not only any confusion around randomness but is likely much more efficient, too.

Best regards

Thomas

Digbalay_Bose · June 2, 2021, 8:02am

Thank you @tom for clarifying the doubts.

In the torch.utils.DataLoader argument, we can pass a function declaration to worker_init_fn. Is it advisable to use worker_init_fn to access the worker’s current seed and seed the other accompanying libraries like random based on the same seed ?

    worker_seed = torch.initial_seed() % 2**32
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

DataLoader(
    train_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    worker_init_fn=seed_worker
)
```. (As mentioned in https://pytorch.org/docs/stable/notes/randomness.html#dataloader)

tom · June 2, 2021, 9:06am

Well, so it depends on

Will you use other libraries’ random functions?
Is there a chance you might be using them inadvertently?
Will others grab your code and do funny things with it and then claim it’s your fault they didn’t get proper randomness?

If any of these is yes or at least might be yes, it is a good idea to initialize randomness. If it is all firmly no, it just adds uninteresting boilerplate to your code.

Personally, I think that mixing RNGs from different libraries is not a great idea (and in fact, trying to be clever around RNGs is usually a good way to shoot yourself into the foot unless you exactly know what you are doing).
Imagine you do this with two libraries that have identical RNGs. Now you seed them to identical states. By some accident you draw random integers in some range from both of them in sync. They will always be identical. Now you combine them, say, by taking the difference. Instead of getting a random number with a symmetric triangular distribution on [-1, 1] which you would get from independent RVs you now get all 0s.
Now this is an obvious and improbable example, but more subtle interactions do exist and happen where people do not expect them.

(For another variation of the “don’t try to be clever” thing: here is a link why, while understanding the motivation, I am skeptical of the “seed with a random number” approach: Random number generator seed mistakes & how to seed an RNG .)

Best regards

Thomas