Hello,
I get the following error (this error has been referenced many times on the internet, but putting it here anyways) -
Traceback (most recent call last):
File "main.py", line 79, in <module>
main()
File "main.py", line 69, in main
distillation = Distillation(train_loader, train_dataset, model_wt_path, config_dict, args.seed) # passing the mean and stddev of the dataset for the images, this is more specific to images when done in this context
File "/x0/megh98/projects/ddist/distillation.py", line 44, in __init__
train_acc = self.run_validation(self.train_loader)
File "/x0/megh98/projects/ddist/distillation.py", line 66, in run_validation
logits = self.net(img)
File "/x0/megh98/anaconda3/envs/ddist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/x0/megh98/projects/ddist/models/networks.py", line 67, in forward
out = self.classifier(out)
File "/x0/megh98/anaconda3/envs/ddist/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/x0/megh98/anaconda3/envs/ddist/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
when I set - torch.use_deterministic_algorithms(True)
. This is no surprise. When I went to the docs here - torch.use_deterministic_algorithms — PyTorch 1.12 documentation , it mentions that torch.mm
is a non deterministic operation. So, I have the following 2 questions -
- Why would a simple matrix multiplication such as
torch.nn
be a non-deterministic operation? - From the error traceback it seems like the non-determinism is coming from the
F.linear
function, which after looking at the pytorch GitHub seems to use torch.matmul which inturn uses torch.mm atleast according to my understanding - and this is what is leading to the non determinism. Am I right in my understanding here?
Thanks!