Random seed with external GPU

Hi all,

I bought a new Palit GeForce RTX 3070 GPU, to speed up my deep learning projects. My laptop is a Dell Latitude 5491 with an Nvidia GeForce MX130 and Intel UHD Graphics 630. I am using the GeForce RTX 3070 in a Razer Core X via Thunderbolt 3.0.

I would like to make my pytorch training reproducible, so I am using:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Symptom: When the device=“cuda:0” its addressing the MX130, and the seeds are working, I got the same result every time. When the device=“cuda:1” its addressing the RTX 3070 and I dont get the same results. Seems like with the external GPU the random seed is not working. When device=“cuda” its automatically uses the RTX 3070 and no reproducibility.
I am working with num_workers=0 and worker_init_fn=np.random.seed(1) in the dataloder. So practically changing the executor GPU has effect on the random seed. I dont want to, and I am not using both GPU-s in parallel.

How can I make the work with external GPU reproducible? I would very appreciate any help. Thanks in advance!

Pytorch version: 1.7.0
Cuda toolkit: 11.0.221
Anaconda version: 2020.07

According to NVIDIA-SMI:
Cuda version: 11.1
Driver version: 457.09

Additionally to the already used arguments, you could also set torch.set_deterministic(True) as described in the Reproducibility docs so that an error is raised in case a non-deterministic method is used.

1 Like

Thank you for the tip it solved my problem! Yes with torch.set_deterministic(True) I’ve got the following error:
RuntimeError: Deterministic behavior was enabled with either torch.set_deterministic(True) or at::Context::setDeterministic(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

So as it turned out the CUBLAS_WORKSPACE_CONFIG=:16:8 or CUBLAS_WORKSPACE_CONFIG=:4096:2 environment variable settings can solve the problem, and make the training reproducible if you have CUDA >=10.2 version.

My internal MX130 probably used an older CUDA version which was not shown in the nvidia-smi, and probably that’s why it worked in that case, but not worked with the eGPU.

PS.: some pytorch layers can’t even work deterministically for example: nn.AdaptiveAvgPool1d() . But in my case I just had to remove the torch.set_deterministic(True) setting and there was no error and the particular layer had no effect on the deterministic results.


Great advise, but where/how do you define the environment variable CUBLAS_WORKSPACE_CONFIG ?
THX in advance :wink:

You can set it as an env variable in your terminal either via export or by prepending it to your application execution:

CUBLAS_WORKSPACE_CONFIG=:16:8 python script.py args

hi @ptrblck,

I am a Windows user and got this error from my terminal that says ‘CUBLAS_WORKSPACE_CONFIG’ is not recognized as an internal or external command, operable program, or batch file. Is there a way to put such a command on a Windows terminal? Thanks in advance!

I’m not sure how exactly the “Windows terminal” works, but I think you can set env variables in Windows in the “Advanced system settings”.

Hi there,
I want to perform Class Activation Mapping on my model.
Setting CUBLAS_WORKSPACE_CONFIG=:16:8 does not work for me.
I am using an AdaptiveConcatPool1d Layer, which seems to not work deterministically at all. Am I right?
Is there a way to get reproducible results for a Class Activation Map and still use that layer?
I need it for my bachelor thesis…
Thanks in advance!

1 Like

AdaptiveConcatPool1d doesn’t seem to be a (core) PyTorch layer, so I don’t know which operations are used internally and if they have a deterministic mode.
In any case, did you set torch.use_deterministic_algorithms(True) and if so, did you receive an error?

Yes, I did. The error said to set CUBLAS_WORKSPACE_CONFIG=:4096:8, which didn’t change anything.
However, I found a workaround by setting torch.manual_seed(0) before getting the feature maps and weights for the class activation map.

Had the exact same case like you