Reproducibility with all the bells and whistles

Had to dig up a lot of docs and discussions to finally figure out the right set of code to make a pytorch pipeline completely reproducible.

There are two things here, firstly you would want to use a seed that will be used to seed pytroch, numpy and python right at the start of main process. Be it training, be it inference.

You would want to use this function:

def seed_all(seed):
    if not seed:
        seed = 10

    print("[ Using Seed : ", seed, " ]")

    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    numpy.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

Just call it right at the start and you are good to go.

Now comes the part where you use worker processes in DataLoader to speed up training. Each worker process is a separate realm and has no relation to the seed that you have used in your main process except for the fact that internally PyTorch seeds itself for you using “base_seed + worker_id” and base_seed is generated based on the seed you used in the main process. So we are still predictable until this part. Keep in mind that whether u define a worker_init_fn or not, PyTorch will get seeded.

The reason why you would want to implement the worker_init_fn is so that you can seed the other libraries that PyTorch does not seed by default. For example numpy and python random module. As you may know, Albumentations library uses python random module and thus making sure that each worker has a unique yet predictable seed is important for randomness. So you would want to use this (Use the worker_seed to seed everything else except PyTorch which has been seeded for u):

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

you would define this function in the global scope and use it like so:

DataLoader(
  train_dataset,
  batch_size=batch_size,
  num_workers=4,
  worker_init_fn=seed_worker,
  shuffle=True
)

I am training on a single GPU and this ensures AtoZ reproducibility. The only catch here is you will get different results if you change the number of workers for obvious reasons.
I hope this post reaches people on time so that they don’t waste precious GPU compute trying to figure out where on earth the randomness comes from.

18 Likes

Thanks for sharing this post!
Would you be interested on adding some of this information to the Randomness docs? :slight_smile:

1 Like

@ptrblck I would love to. Thanks for the encouragement.

1 Like

Sounds good! Please create a feature request here, explain your use case a bit (or link to your initial post) and propose some changes. Once it’s reviewed, you should be good to go to create the PR! :slight_smile:

1 Like
1 Like

Thank you a lot, it works for me!
I have one follow up question:

i get the same train_eval-results when i do:

set_seed(10)  # does everything you mention above
train_eval_loop()
set_seed(10)
train_eval_loop()

but not when I do

set_seed(10)
train_eval_loop()
train_eval_loop()

I don’t really get why, because I just set the same settings once again.
Can anyone explain this to me?

In your first approach you are resetting the seeds before calling train_eval_loop, which makes sure that the random operations in train_eval_loop will sample the same random numbers and thus would yield the same results.
In the second approach, you are seeding the code only once, so that multiple calls into train_eval_loop will use the pseudorandom number generator in its current state and could yield different results. However, rerunning the second approach in multiple sessions should again yield the same results.

Usually you would stick to the second approach only, since you don’t want to sample the same “random” numbers in each training iteration (they would not be random anymore).

1 Like

A bit late to the party but how would you do this, given Numpy’s recommendations are to drop the legacy global np.random.seed() in favor of rng = np.random.default_rng(seed) and then using rng ?

I get that you can do

def seed_all(seed):
    if not seed:
        seed = 10

    print("[ Using Seed : ", seed, " ]")

    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    rng = np.random.default_rng(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    return rng

But how would you go about seed_worker() ?

Numpy: Random Generator

Could you explain why we use torch.initial_seed()%2**32 to set the seed of the worker?
If I print the worker_seed in seed_worker(), I get a different result each time the workers are initialized at the start of each epoch.
Now the worker_seed is consistent in between each time I execute my training script, but it differs between each epoch.

Why is this?

The seeds should be changed between epochs to avoid applying the same “random” transformations in each epoch.