RuntimeError: Function 'TBackward' returned nan values in its 0th output

Lei_Yang · December 15, 2020, 3:21pm

RuntimeError

Sorry to bore, but may I know some hints to solve this problem?

I use the mnist example provided by pytorch by running main.py with autograd anomaly detection set to True, and get this error.
I actually observed this error in another codes of mine, and thought it was triggered by taking sqrt of a very small small number. I tested my codes (and this example) with cpu and on other machine and they run just fine.
I think I might have something wrong with my cuda installation but I cannot find a solution. So, I am here to look for any hints that might be helpful. Do I need to install cuda and cudnn that matches the version of cudatoolkit shipped with pytorch conda install?

Pytorch version

pytorch 1.5.0 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch

here are the error report

Warning: Error detected in TBackward. Traceback of forward call that caused the error:
File “main.py”, line 139, in
main()
File “main.py”, line 130, in main
train(args, model, device, train_loader, optimizer, epoch)
File “main.py”, line 42, in train
output = model(data)
File “/home/yanglei/anaconda3/envs/pt150cu101/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “main.py”, line 29, in forward
x = self.fc1(x)
File “/home/yanglei/anaconda3/envs/pt150cu101/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/yanglei/anaconda3/envs/pt150cu101/lib/python3.8/site-packages/torch/nn/modules/linear.py”, line 87, in forward
return F.linear(input, self.weight, self.bias)
File “/home/yanglei/anaconda3/envs/pt150cu101/lib/python3.8/site-packages/torch/nn/functional.py”, line 1610, in linear
ret = torch.addmm(bias, input, weight.t())
(print_stack at /opt/conda/conda-bld/pytorch_1587428094786/work/torch/csrc/autograd/python_anomaly_mode.cpp:60)
Traceback (most recent call last):
File “main.py”, line 139, in
main()
File “main.py”, line 130, in main
train(args, model, device, train_loader, optimizer, epoch)
File “main.py”, line 44, in train
loss.backward()
File “/home/yanglei/anaconda3/envs/pt150cu101/lib/python3.8/site-packages/torch/tensor.py”, line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/yanglei/anaconda3/envs/pt150cu101/lib/python3.8/site-packages/torch/autograd/init.py”, line 98, in backward
Variable._execution_engine.run_backward(
RuntimeError: Function ‘TBackward’ returned nan values in its 0th output.

ptrblck · December 16, 2020, 8:28am

You don’t need a local CUDA installation, if you don’t plan to build PyTorch from source or any custom CUDA extension (or of course any other CUDA application).
That being said, could you update to the latest stable release and check, if you are still facing this issue?

Lei_Yang · December 16, 2020, 10:40am

Hi ptrblck,

Thanks for your reply. After testing multiple settings of PyTorch and Cuda from conda install and docker (using the anibali/pytorch image) as well, I finally changed a GPU card and the errors disappeared. I mounted the GPU card to another machine, on which everything runs fine, and reproduced the error. So, I think it is the GPU card’s problem, but not due to PyTorch or other packages or contaminated environments.

BTW, I was suggested using this gpu-burn tool (https://github.com/wilicc/gpu-burn) to test if the GPU card is problematic. I used it and it works well. So, I just put it here for those who encounter similar errors.

Lei_Yang · December 16, 2020, 11:22am

I just test it again on two pytorch settings (one being the latest)

pytorch 1.7.1 py38_cuda11.0.221_cudnn8.0.5_0
pytorch 1.7.0 py38_cuda11.0.221_cudnn8.0.3_0
and they also encounter the same error.

One situation is a bit strange to me.
When I set False to the anomaly detection mode, I am able to run with the main.py script in the mnist folder from pytorch example git repo.
However, in the same setting, error occurs if I run the main.py script in the word_language_model folder.

The error is as follow (main.py refers to the one under word_language_model)

Traceback (most recent call last):
File “main.py”, line 217, in
train()
File “main.py”, line 181, in train
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
File “/home/yangle/anaconda3/envs/pytorch_latest/lib/python3.8/site-packages/torch/nn/utils/clip_grad.py”, line 38, in clip_grad_norm_
if clip_coef < 1:
RuntimeError: CUDA error: an illegal memory access was encountered

ptrblck · December 17, 2020, 3:38am

Are you seeing the illegal memory access using the “bad” GPU or another one?
Which GPU are you using at the moment? I assume you haven’t changed anything in the tutorial and are just running the script as it is?

Lei_Yang · December 17, 2020, 12:04pm

All errors occur ONLY on the “bad” GPU. Above I just want to clarify that even with the latest stable releases of pytorch, errors still occur on the “bad” GPU.

I just ran the scripts as it is. In this situation, even with the “bad” GPU, the main.py in pytorch/mnist runs well; no error occurs. But the main.py in pytorch/word_language_model (using the LSTM model) will have the error “RuntimeError: CUDA error: an illegal memory access was encountered”.
In the second situation where I made a minimal change to the scripts for the testing purpose: I added the line “torch.autograd.set_detect_anomaly(True):” prior to the iteration for loop. Then, in BOTH scripts of mnist and word_language_model, appears the error " RuntimeError: Function ‘TBackward’ returned nan values in its 0th output" (not necessarily the TBackward, but could be other functions).

ptrblck · December 17, 2020, 5:10pm

If these issues are specific to a particular GPU, I would assume it might be a hardware defect.
Are you able to reproduce the illegal memory access in every run on this device and are other GPUs working fine?

Lei_Yang · December 18, 2020, 7:13am

Regarding this, I did two experiments.

On my own desktop, using the “bad” GPU and another GPU.
On the “bad” GPU, the two example scripts report errors (when autograd anomaly detection is activated). On the other GPU, no error is reported.
On another desktop, the same errors can be reproduced with the “bad” GPU. Using another GPU on this another desktop will not report errors.

The “bad” GPU is RTX2080Ti, while the other GPU is 1080Ti (because no available 2080Ti at hand). I don’t think it’s PyTorch’s problem; it shall be a hardware issue.

I don’t know if this is a too-much request for PyTorch: Is it possible to have a small testing code in PyTorch (or there has been one but I am just not aware) to check if a GPU is faulty? In this way, it might be helpful to identify if this is a hardware or software problem. Anyway, it might be too much for PyTorch, then just forget it.

ptrblck · December 18, 2020, 8:29am

I don’t think it’s a good approach to add hardware checks into a framework, as these kind of issues are so variable that the majority of issues might point to the hardware even though we might be using a faulty kernel. I’m a bit afraid we could claim a hardware issue too early and ship bad code.

There is one thing you could check using the bad GPU:
could you rerun the code, let it error out, and check dmesg for any xid entries and report the error code here, please?

Lei_Yang · January 10, 2021, 8:24am

Hi Ptrblck. Sorry for my late reply. Unfortunately, I have returned the ‘bad’ GPU card to the vendor for exchanging a new one, so cannot do what you’ve suggested.

ptrblck · January 18, 2021, 8:21am

I hope the new device is working properly!