A error when using GPU

I guess that using cudnn.banchmark=True with pytorch 1.0 + RTX2080 + Dockerfile cuda10.0-cudnn7-devel-ubuntu18.04 would cause this error.

It has the same performance as without using CUDA_LAUNCH_BLOCKING=1, the problem still exists, but no other error.
Thank you so much for reply !

The CUDA_LAUNCH_BLOCKING=1 env variable just makes sure to call all CUDA operations synchronously, so that an error message should point to the right line of code in the stack trace.
Did you get any errors? If so, could you post the stack trace?

2 Likes

The error message is “THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument”. But I can’t find the THCGeneral.cpp.

Is your code running fine on the CPU? Could you post the whole stack trace?

I am also having the same issue with 2080Ti.

If I enable cudnn.benchmark, my code gives the error and crashes. If I disable cudnn.benchmark, my code still gives the same error but it can still run. Adding CUDA_LAUNCH_BLOCKING=1 doesn’t give anymore details. It stills shows that the code crashes when it reaches the first convolution.

The code was running fine on 1080Ti with cudnn.benchmark enabled.

1 Like

Same here, I get this error message on my RTX 2080 Ti but not on the 1080 Ti, same Pytorch (1.0.0) and CUDA (10.0.130), python 3.5.2.

Code to produce the warning/error:

import os
import torch

# force torch to use my RTX 2080TI GPU, modify or remove accordingly
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
torch.backends.cudnn.benchmark = True

from torchvision.models import vgg16
model = vgg16().cuda()
x = torch.zeros((32, 3, 227, 227)).cuda()
model(x)

Prints the error (THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument) and doesn’t return anything.

I have a RTX 2070 with CUDA 10, pytorch 1.0, python 3.6 on Ubuntu 18 and I get this error when running this project: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

With torch.backends.cudnn.benchmark = True I get the below stack trace and the program exits.

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
  File "test.py", line 60, in <module>
    model.test()           # run inference
  File "/home/jwickens/dev/face-translation/pytorch-CycleGAN-and-pix2pix/models/base_model.py", line 105, in test
    self.forward()
  File "/home/jwickens/dev/face-translation/pytorch-CycleGAN-and-pix2pix/models/test_model.py", line 65, in forward
    self.fake_B = self.netG(self.real_A)  # G(A)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/dev/face-translation/pytorch-CycleGAN-and-pix2pix/models/networks.py", line 399,in forward
    return self.model(input)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:663

Without that line I get a silent CUDA error once at the beginning. The script works though. THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument

I also have the same silent error with this tutorial https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

There are quite a few issues out here for this error message, some users say its cuda 9.2 and others RTX cards.

RTX 2080ti with cuda 10.0. I got the same problem. I followed the advice of others and turned `torch.backends.cudnn.benchmark = True’ to False and things started working again

2 Likes

After installing Pytorch in this way: pip install -U https://download.pytorch.org/whl/cu100/torch-1.0.0-cp36-cp36m-linux_x86_64.whl, the errors will disappear even when you are using 'torch.backends.cudnn.benchmark = True’

5 Likes

Thanks! But I want to know how to solve this problem on Pytorch 1.0.0, CUDA 9.0, RTX 2080. Must change to CUDA 10.0?

I don’t have RTX 2080 cards and chances are that the driver shipped with CUDA 9.0 is not fully compatible with RTX 2080. I installed CUDA 10.1 at first. After that, I downgraded the CUDA version to 10.0 while not changing the driver. Hope this can help you.

1 Like

Hi,

I see the same issue, with pytorch 1.0.1.post2, CUDA10.0, RTX2080Ti. I can run on another GPU (tried TitanV and 1080Ti), but if running on the 2080Ti, with benchmark=True, I get this error message:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
Traceback (most recent call last):
  File "", line 330, in <module>
    train(epoch)
  File "", line 173, in train
    stereo_out, theta, right_transformed = model(left,right)
  File "/home/yotamg/PycharmProjects/PSMNet/venv3/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "", line 139, in forward
    right_img_transformed, theta = self.stn(right_img)
  File "", line 127, in stn
    x,theta1 = stn(x, self.theta(x), mode=self.stn_mode)
  File "", line 131, in theta
    xs = self.localization(x)
  File "/home/yotamg/PycharmProjects/PSMNet/venv3/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yotamg/PycharmProjects/PSMNet/venv3/local/lib/python2.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/yotamg/PycharmProjects/PSMNet/venv3/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yotamg/PycharmProjects/PSMNet/venv3/local/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

Is there already a solution for that?

Thanks, Yotam

I’ve got the same hardware (RTX 2080ti) and this fixed it for me. I had to update pytorch to use CUDA 10.

Thanks a lot. It works. But why does it work?

Thanks very much this works for me! phew!

I’ve got the same error. even after I update CUDA to 10.
I happened to find a way to remove it. Now my code of training works.

  • cuda: 10.0
  • python: 3.7
  • pytorch: 1.0
  • cudnn: 7
  • GPU: 2080ti

however, another problem came along when run with 'with torch.no_grad(), the output are all nans.
anyone know this?

Hi, I still have the problem with Cuda10 and 2080ti. Could you share your solution, please? @ janehu

set torch.backends.cudnn.benchmark = True worked for me

1 Like

thanks, l have met the same error when update pytorch1.0 to 1.1 with RTX2080Ti. Setting cudnn.benchmark = False could help to avoid this error, but in pytorch1.0 cudnn.benchmark = True is no problem.:sweat_smile: