Pytorhc1.6+nvidia-driver418+2080ti results in:RuntimeError: CUDA error: an illegal memory access was encountered

Hello every, I always encountered this illegal memory access bug no matter what project I trained

bug description

I’ll use one project as example to describe the bug:

 1.   git clone git@github.com:weiaicunzai/pytorch-cifar100.git
 2.   cd pytorch-cifar100
 3.   python train.py -net vgg16 -gpu

Then I get:

Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/jit/__init__.py:1119: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
With rtol=1e-05 and atol=1e-05, found 3 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1717986918400.0 (8.985408308668006e+16 vs. 8.985236509976166e+16), which occurred at index (0, 67).
  check_tolerance, strict, _force_outplace, True, _module_class)
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000]   Loss: 4.7142    LR: 0.000256
Training Epoch: 1 [256/50000]   Loss: 4.7673    LR: 0.000512
Training Epoch: 1 [384/50000]   Loss: 4.6759    LR: 0.000767
Training Epoch: 1 [512/50000]   Loss: 4.6476    LR: 0.001023
Training Epoch: 1 [640/50000]   Loss: 4.7125    LR: 0.001279
Training Epoch: 1 [768/50000]   Loss: 4.7280    LR: 0.001535
Training Epoch: 1 [896/50000]   Loss: 4.7359    LR: 0.001790
Training Epoch: 1 [1024/50000]  Loss: 4.7307    LR: 0.002046
Training Epoch: 1 [1152/50000]  Loss: 4.6513    LR: 0.002302
Training Epoch: 1 [1280/50000]  Loss: 4.6054    LR: 0.002558
Training Epoch: 1 [1408/50000]  Loss: 4.6228    LR: 0.002813
Training Epoch: 1 [1536/50000]  Loss: 4.6426    LR: 0.003069
Traceback (most recent call last):
  File "train.py", line 209, in <module>
    train(epoch)
  File "train.py", line 52, in train
    writer.add_scalar('LastLayerGradients/grad_norm2_weights', para.grad.norm(), n_iter)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 344, in add_scalar
    scalar(tag, scalar_value), global_step, walltime)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 181, in scalar
    scalar = make_np(scalar)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 28, in make_np
    return _prepare_pytorch(x)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/utils/tensorboard/_convert_np.py", line 36, in _prepare_pytorch
    x = x.cpu().numpy()
RuntimeError: CUDA error: an illegal memory access was encountered

But the same code works fine on Google colab.Then I tried to run CUDA_LAUNCH_BLOCKING=1 python train.py -net resnet18 -gpu:

gives me this:

Files already downloaded and verified
Files already downloaded and verified
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/jit/__init__.py:1119: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
With rtol=1e-05 and atol=1e-05, found 1 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 2779565395017728.0 (-2.68501056162248e+20 vs. -2.6850383572764302e+20), which occurred at index (0, 48).
  check_tolerance, strict, _force_outplace, True, _module_class)
/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Training Epoch: 1 [128/50000]   Loss: 4.8053    LR: 0.000256
Training Epoch: 1 [256/50000]   Loss: 4.7550    LR: 0.000512
Training Epoch: 1 [384/50000]   Loss: 4.7402    LR: 0.000767
Training Epoch: 1 [512/50000]   Loss: 4.7109    LR: 0.001023
Training Epoch: 1 [640/50000]   Loss: 4.6995    LR: 0.001279
Training Epoch: 1 [768/50000]   Loss: 4.7558    LR: 0.001535
Training Epoch: 1 [896/50000]   Loss: 4.6715    LR: 0.001790
Training Epoch: 1 [1024/50000]  Loss: 4.7487    LR: 0.002046
Training Epoch: 1 [1152/50000]  Loss: 4.6628    LR: 0.002302
Training Epoch: 1 [1280/50000]  Loss: 4.6444    LR: 0.002558
Training Epoch: 1 [1408/50000]  Loss: 4.6350    LR: 0.002813
Training Epoch: 1 [1536/50000]  Loss: 4.5711    LR: 0.003069
Training Epoch: 1 [1664/50000]  Loss: 4.6021    LR: 0.003325
Traceback (most recent call last):
  File "train.py", line 209, in <module>
    train(epoch)
  File "train.py", line 44, in train
    loss.backward()
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception raised from operator() at /opt/conda/conda-bld/pytorch_1595629416375/work/aten/src/ATen/native/cudnn/Conv.cpp:980 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f790a3ec77d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcad302 (0x7f790b4ff302 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xcaf235 (0x7f790b501235 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xcaa48e (0x7f790b4fc48e in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xcac07b (0x7f790b4fe07b in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_input(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) + 0xb2 (0x7f790b4fe5d2 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd117db (0x7f790b5637db in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd415f8 (0x7f790b5935f8 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::cudnn_convolution_backward_input(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) + 0x1ad (0x7f793d7b8ced in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x223 (0x7f790b4fcca3 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xd118c5 (0x7f790b5638c5 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0xd41654 (0x7f790b593654 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #12: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f793d7c76a2 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x2c250c2 (0x7f793f48b0c2 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2c39684 (0x7f793f49f684 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f793d7c76a2 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x258 (0x7f793f312098 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x30d1017 (0x7f793f937017 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f793f932860 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f793f933401 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f793f92b579 in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f7943c5a13a in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #22: <unknown function> + 0xc819d (0x7f794679019d in /home/by/miniconda3/envs/baiyu_py36/lib/python3.6/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #23: <unknown function> + 0x76db (0x7f796988a6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #24: clone + 0x3f (0x7f79695b3a3f in /lib/x86_64-linux-gnu/libc.so.6)

Environment

Here is my current env:

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti

Nvidia driver version: 418.152.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] blas                      1.0                         mkl    defaults
[conda] cudatoolkit               10.1.243             h6bb024c_0    defaults
[conda] mkl                       2020.2                      256    defaults
[conda] mkl-service               2.3.0            py36he904b0f_0    defaults
[conda] mkl_fft                   1.2.0            py36h23d657b_0    defaults
[conda] mkl_random                1.1.1            py36h0573a6f_0    defaults
[conda] numpy                     1.19.2           py36h54aff64_0    defaults
[conda] numpy-base                1.19.2           py36hfa32c7d_0    defaults
[conda] pytorch                   1.6.0           py3.6_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchvision               0.7.0                py36_cu101    pytorch

I’ve also posted an issue in Github repository, but didn’t seem to get an solution, so could some one tells me where I could possiblely go wrong? Thanks.