A error when using GPU

zhangying1230 · December 21, 2018, 2:05pm

The error is
“THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument”. But it doesn’t influence the training and test, I want to know the reason for this error.
My cuda version is 9.0 and the python version is 3.6.

Thank you for help

ptrblck · December 22, 2018, 8:58am

Does your code just run with this error at the beginning?
If so, are you using multiple GPUs?
Could you try to run your code with CUDA_LAUNCH_BLOCKING=1 python script.py args?

I’ve never seen a “silent” CUDA error. Usually my code just explodes with error messages when I mess up some CUDA calls.

yang_lee · December 23, 2018, 9:30am

hello, does your preblem is solved? I have also met the preblem

zhangying1230 · December 24, 2018, 2:27pm

I have tried the code “CUDA_LAUNCH_BLOCKING=1 python all_resnet.py” , but it doesn’t work. : (
Thank you very much for reply! Merry Christmas:smile:

zhangying1230 · December 24, 2018, 2:29pm

No, I haven’t solved it yet : (

Merry Christmas!

ptrblck · December 25, 2018, 12:29pm

Does your code crash with some error message or does it just get stuck, if you use CUDA_LAUNCH_BLOCKING=1?

Merry Christmas!

Huang_Wade · December 28, 2018, 9:28am

Hi,all
I met the same problem. My cuda version is 10.0 and the python version is 3.6.
I got error message is “RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405”

After I used CUDA_LAUNCH_BLOCKING=1 python script.py args, it only used one GPU to work.

Dose anyone have solution?
Merry Christmas!

ptrblck · December 28, 2018, 4:40pm

Could you post the complete stack trace with the lines of code this error is pointing to?
Also, is your code working on the CPU?

andys0975 · January 17, 2019, 4:09am

It seems like some conflicts occur when using pytorch 1.0 with RTX2080(Ti) series.
I used the same code and the same docker image to run, and this kind of error only happened on RTX2080Ti server. (My docker image is built up from cuda10-cudnn7.4 image and the driver version is 410.79)

andys0975 · January 17, 2019, 11:28am

I found the source of this problem: cudnn.benchmark

ptrblck · January 18, 2019, 3:22am

Is it working, if you disable cudnn.banchmark or what do you mean?

andys0975 · January 18, 2019, 3:57am

I guess that using cudnn.banchmark=True with pytorch 1.0 + RTX2080 + Dockerfile cuda10.0-cudnn7-devel-ubuntu18.04 would cause this error.

zhangying1230 · January 23, 2019, 12:54pm

It has the same performance as without using CUDA_LAUNCH_BLOCKING=1, the problem still exists, but no other error.
Thank you so much for reply !

ptrblck · January 23, 2019, 6:32pm

The CUDA_LAUNCH_BLOCKING=1 env variable just makes sure to call all CUDA operations synchronously, so that an error message should point to the right line of code in the stack trace.
Did you get any errors? If so, could you post the stack trace?

zhangying1230 · January 25, 2019, 2:22pm

The error message is “THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument”. But I can’t find the THCGeneral.cpp.

ptrblck · January 25, 2019, 3:52pm

Is your code running fine on the CPU? Could you post the whole stack trace?

heilaw · January 25, 2019, 11:31pm

I am also having the same issue with 2080Ti.

If I enable cudnn.benchmark, my code gives the error and crashes. If I disable cudnn.benchmark, my code still gives the same error but it can still run. Adding CUDA_LAUNCH_BLOCKING=1 doesn’t give anymore details. It stills shows that the code crashes when it reaches the first convolution.

The code was running fine on 1080Ti with cudnn.benchmark enabled.

jclevesque · January 30, 2019, 3:45pm

Same here, I get this error message on my RTX 2080 Ti but not on the 1080 Ti, same Pytorch (1.0.0) and CUDA (10.0.130), python 3.5.2.

Code to produce the warning/error:

import os
import torch

# force torch to use my RTX 2080TI GPU, modify or remove accordingly
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
torch.backends.cudnn.benchmark = True

from torchvision.models import vgg16
model = vgg16().cuda()
x = torch.zeros((32, 3, 227, 227)).cuda()
model(x)

Prints the error (THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument) and doesn’t return anything.

jwickens · February 11, 2019, 6:11am

I have a RTX 2070 with CUDA 10, pytorch 1.0, python 3.6 on Ubuntu 18 and I get this error when running this project: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

With torch.backends.cudnn.benchmark = True I get the below stack trace and the program exits.

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
  File "test.py", line 60, in <module>
    model.test()           # run inference
  File "/home/jwickens/dev/face-translation/pytorch-CycleGAN-and-pix2pix/models/base_model.py", line 105, in test
    self.forward()
  File "/home/jwickens/dev/face-translation/pytorch-CycleGAN-and-pix2pix/models/test_model.py", line 65, in forward
    self.fake_B = self.netG(self.real_A)  # G(A)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/dev/face-translation/pytorch-CycleGAN-and-pix2pix/models/networks.py", line 399,in forward
    return self.model(input)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:663

Without that line I get a silent CUDA error once at the beginning. The script works though. THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument

I also have the same silent error with this tutorial https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

There are quite a few issues out here for this error message, some users say its cuda 9.2 and others RTX cards.

ggeor · February 14, 2019, 7:59pm

RTX 2080ti with cuda 10.0. I got the same problem. I followed the advice of others and turned `torch.backends.cudnn.benchmark = True’ to False and things started working again