Reproducibility not possible despite following PyTorch guidelines

papoo13 · July 4, 2022, 4:03pm

I cannot reproduce my results even when I set the seeds following PyTorch guidelines. This is my seeding configuration:

def configure_random_seed(args):
    with logger.LoggingBlock("Random Seeds", emph=True):
        # python
        seed = args.seed
        random.seed(seed)
        logging.info("Python seed: %i" % seed)
        # numpy
        seed += 1
        np.random.seed(seed)
        logging.info("Numpy seed: %i" % seed)
        # torch
        seed += 1
        torch.manual_seed(seed)
        logging.info("Torch CPU seed: %i" % seed)
        # torch cuda
        seed += 1
        torch.cuda.manual_seed_all(seed)
        torch.cuda.manual_seed(seed)
        logging.info("Torch CUDA seed: %i" % seed)

        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

I also set:

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

 gpuargs = {
            "num_workers": args.num_workers,
            "pin_memory": True,
            "worker_init_fn": seed_worker} if 'cuda' in args.device else {}

train_loader = DataLoader(
                train_dataset,
                batch_size=args.batch_size,
                shuffle=True,
                drop_last=False,
                **gpuargs)

Despite this, I still get widely different results (5% accuracy difference). While reading the PyTorch guidelines, I read this part which I am not sure if I understand fully.

“However, some applications and libraries may use NumPy Random Generator objects, not the global RNG (Random Generator — NumPy v1.26 Manual), and those will need to be seeded consistently as well.”

Maybe I need to set something different? anyone knows how to take care of NumPy Random Generator? I appreciate your guidance.

P.S. I’m working with a Bayesian network (Variational BNN) in a continual learning setting.

ptrblck · July 4, 2022, 10:07pm

Did you also set torch.use_deterministic_algorithms(True) as mentioned in the Reproducibility docs?
Seeding might not be enough to get deterministic and reproducible outputs if the algorithms themselves produce the non-deterministic results.

papoo13 · July 5, 2022, 7:36am

I’m setting torch backends.cudnn.deterministic = True. This is different from what you have mentioned?

papoo13 · July 5, 2022, 7:40am

and one more question regarding torch.use_deterministic_algorithms(True), where should I set it up? so, I have many modules in my code where I am importing torch. Should I add torch.use_deterministic_algorithms(True) after import torch?

ptrblck · July 5, 2022, 7:44am

Yes, it’s a different setting which picks deterministic algorithms for native kernels, checks that the cublas workspace size was properly set etc.

Set it right after the import.

papoo13 · July 6, 2022, 12:09pm

Thank you for your reply. I imported torch.use_deterministic_algorithms(True) after import torch in all the modules that imported torch in my code. I get the following error:

RuntimeError: Deterministic behavior was enabled with either torch.use_deterministic_algorithms(True) or at::Context::setDeterministicAlgorithms(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to cuBLAS

The line of code where the error comes from is: F.linear(input_means, self.weight, self.bias)
I tried setting up the first environment variable before python main.py but hasn’t worked so far.
Any ideas on this? does it mean that there is no deterministic equivalent to F.linear()

I need to add that my CUDA driver version is 510.73.05 and CUDA_Version is 11.6.

ptrblck · July 7, 2022, 12:00am

If you are still seeing the error after trying to set the env variable, it might have been too late in the script.
Set it as an external env variable or during the launch:

CUBLAS_WORKSPACE_CONFIG=:4096:8 python script.py

papoo13 · July 8, 2022, 10:56am

Thank you very much. I added the argument before python script.py. It worked and then I was getting an error regarding the non-deterministic functions so I added torch.use_deterministic_algorithms(True, warn_only=True).
Now, I get this warning instead:

UserWarning: nll_loss2d_forward_out_cuda_template does not have a deterministic implementation, but you set ‘torch.use_deterministic_algorithms(True, warn_only=True)’. You can file an issue at Issues · pytorch/pytorch · GitHub to help us prioritize adding deterministic support for this operation. (Triggered internally at /opt/conda/conda-bld/pytorch_1646755897462/work/aten/src/ATen/Context.cpp:79.)

and my result is not identical. I am running a continual learning application (prior based: using variational inference) and in two runs I get accuracies for task 1 and task 2 as follows:
Task 1: Run1: 94.8%, Run2: 96.40%
Task2: Run1:83.80%, Run2: 81.00%
and this gap continues for the upcoming tasks.

I wonder whether this is normal in light of having one non-deterministic function, here nll_loss2d_forward_out_cuda?

ptrblck · July 8, 2022, 4:34pm

It’s hard to tell as it would depend on the overall stability of your training.
I.e. assuming you would get deterministic results you could rerun the script with different seeds and compare how the model would perform in the end.
Since you are already using non-deterministic methods, you wouldn’t have a baseline but the test might still be interesting.

Also, in case you are not on the latest PyTorch version, install 1.12.0 or the nightly release to check if this operation has a deterministic version now.