I bought a new Palit GeForce RTX 3070 GPU, to speed up my deep learning projects. My laptop is a Dell Latitude 5491 with an Nvidia GeForce MX130 and Intel UHD Graphics 630. I am using the GeForce RTX 3070 in a Razer Core X via Thunderbolt 3.0.
I would like to make my pytorch training reproducible, so I am using:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Symptom: When the device=“cuda:0” its addressing the MX130, and the seeds are working, I got the same result every time. When the device=“cuda:1” its addressing the RTX 3070 and I dont get the same results. Seems like with the external GPU the random seed is not working. When device=“cuda” its automatically uses the RTX 3070 and no reproducibility.
I am working with num_workers=0 and worker_init_fn=np.random.seed(1) in the dataloder. So practically changing the executor GPU has effect on the random seed. I dont want to, and I am not using both GPU-s in parallel.
How can I make the work with external GPU reproducible? I would very appreciate any help. Thanks in advance!
Pytorch version: 1.7.0
Cuda toolkit: 11.0.221
Anaconda version: 2020.07
According to NVIDIA-SMI:
Cuda version: 11.1
Driver version: 457.09
Additionally to the already used arguments, you could also set
torch.set_deterministic(True) as described in the Reproducibility docs so that an error is raised in case a non-deterministic method is used.
Thank you for the tip it solved my problem! Yes with
torch.set_deterministic(True) I’ve got the following error:
RuntimeError: Deterministic behavior was enabled with either torch.set_deterministic(True) or at::Context::setDeterministic(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
So as it turned out the
CUBLAS_WORKSPACE_CONFIG=:4096:2 environment variable settings can solve the problem, and make the training reproducible if you have CUDA >=10.2 version.
My internal MX130 probably used an older CUDA version which was not shown in the nvidia-smi, and probably that’s why it worked in that case, but not worked with the eGPU.
PS.: some pytorch layers can’t even work deterministically for example:
nn.AdaptiveAvgPool1d() . But in my case I just had to remove the
torch.set_deterministic(True) setting and there was no error and the particular layer had no effect on the deterministic results.
Great advise, but where/how do you define the environment variable CUBLAS_WORKSPACE_CONFIG ?
THX in advance
You can set it as an env variable in your terminal either via
export or by prepending it to your application execution:
CUBLAS_WORKSPACE_CONFIG=:16:8 python script.py args
I am a Windows user and got this error from my terminal that says ‘CUBLAS_WORKSPACE_CONFIG’ is not recognized as an internal or external command, operable program, or batch file. Is there a way to put such a command on a Windows terminal? Thanks in advance!
I’m not sure how exactly the “Windows terminal” works, but I think you can set env variables in Windows in the “Advanced system settings”.
I want to perform Class Activation Mapping on my model.
CUBLAS_WORKSPACE_CONFIG=:16:8 does not work for me.
I am using an AdaptiveConcatPool1d Layer, which seems to not work deterministically at all. Am I right?
Is there a way to get reproducible results for a Class Activation Map and still use that layer?
I need it for my bachelor thesis…
Thanks in advance!
AdaptiveConcatPool1d doesn’t seem to be a (core) PyTorch layer, so I don’t know which operations are used internally and if they have a deterministic mode.
In any case, did you set
torch.use_deterministic_algorithms(True) and if so, did you receive an error?
Yes, I did. The error said to set CUBLAS_WORKSPACE_CONFIG=:4096:8, which didn’t change anything.
However, I found a workaround by setting torch.manual_seed(0) before getting the feature maps and weights for the class activation map.
Had the exact same case like you
Is there an equivalent to writing this when I am using visual Studio Code. I need to somehow add this to my environment in the launch.json file but so far I haven’t been able to get it to work.
I’m not familiar enough with Visual Studio and how the
launch.json file is used. You could try to set it in your Python script right at the beginning before importing torch and before creating any CUDA context. Note however, that users are often running into issues when trying to set env variables inside the script, as it can be too late and the context might have already been initialized.
If you want to always enable it, you might just want to
Thank you very much. I try the first option and the check the results with different seeds to see if it has taken effect.
I am not sure what you mean by exporting it. Could you please explain it a bit more?
export CUBLAS_WORKSPACE_CONFIG=:16:8 will set this env variable in your terminal and you wouldn’t need to add it to each command (similar to how e.g.
LD_LIBRARY_PATH is set).
One more question,
I am running my code on a certain conda environment. should I export the variable when my conda environment is activated?
export it in your terminal before executing the Python script, but it doesn’t matter if the conda environment is already activated or not. Note that
export will only set this env variable in the current terminal. If you want to save it for all terminals (and also to be active after a reboot) you can write it to your
Yes, there is.
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"