A error when using GPU


(张颖) #1

The error is
“THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument”. But it doesn’t influence the training and test, I want to know the reason for this error.
My cuda version is 9.0 and the python version is 3.6.


Thank you for help :smile:


#2

Does your code just run with this error at the beginning?
If so, are you using multiple GPUs?
Could you try to run your code with CUDA_LAUNCH_BLOCKING=1 python script.py args?

I’ve never seen a “silent” CUDA error. Usually my code just explodes with error messages when I mess up some CUDA calls. :smiley:


(Yang Lee) #3

hello, does your preblem is solved? I have also met the preblem


(张颖) #4

I have tried the code “CUDA_LAUNCH_BLOCKING=1 python all_resnet.py” , but it doesn’t work. : (
Thank you very much for reply! Merry Christmas:smile:


(张颖) #5

No, I haven’t solved it yet : (

Merry Christmas!


#6

Does your code crash with some error message or does it just get stuck, if you use CUDA_LAUNCH_BLOCKING=1?

Merry Christmas! :wink:


(Huang Wade) #7

Hi,all
I met the same problem. My cuda version is 10.0 and the python version is 3.6.
I got error message is “RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405”

After I used CUDA_LAUNCH_BLOCKING=1 python script.py args, it only used one GPU to work.

Dose anyone have solution?
Merry Christmas!


#8

Could you post the complete stack trace with the lines of code this error is pointing to?
Also, is your code working on the CPU?


(鄭仕群) #9

It seems like some conflicts occur when using pytorch 1.0 with RTX2080(Ti) series.
I used the same code and the same docker image to run, and this kind of error only happened on RTX2080Ti server. (My docker image is built up from cuda10-cudnn7.4 image and the driver version is 410.79)


(鄭仕群) #10

I found the source of this problem: cudnn.benchmark


#11

Is it working, if you disable cudnn.banchmark or what do you mean?


(鄭仕群) #12

I guess that using cudnn.banchmark=True with pytorch 1.0 + RTX2080 + Dockerfile cuda10.0-cudnn7-devel-ubuntu18.04 would cause this error.


(张颖) #13

It has the same performance as without using CUDA_LAUNCH_BLOCKING=1, the problem still exists, but no other error.
Thank you so much for reply !


#14

The CUDA_LAUNCH_BLOCKING=1 env variable just makes sure to call all CUDA operations synchronously, so that an error message should point to the right line of code in the stack trace.
Did you get any errors? If so, could you post the stack trace?


(张颖) #15

The error message is “THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument”. But I can’t find the THCGeneral.cpp.


#16

Is your code running fine on the CPU? Could you post the whole stack trace?


(Hei Law) #17

I am also having the same issue with 2080Ti.

If I enable cudnn.benchmark, my code gives the error and crashes. If I disable cudnn.benchmark, my code still gives the same error but it can still run. Adding CUDA_LAUNCH_BLOCKING=1 doesn’t give anymore details. It stills shows that the code crashes when it reaches the first convolution.

The code was running fine on 1080Ti with cudnn.benchmark enabled.


(JCL) #18

Same here, I get this error message on my RTX 2080 Ti but not on the 1080 Ti, same Pytorch (1.0.0) and CUDA (10.0.130), python 3.5.2.

Code to produce the warning/error:

import os
import torch

# force torch to use my RTX 2080TI GPU, modify or remove accordingly
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
torch.backends.cudnn.benchmark = True

from torchvision.models import vgg16
model = vgg16().cuda()
x = torch.zeros((32, 3, 227, 227)).cuda()
model(x)

Prints the error (THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument) and doesn’t return anything.


#19

I have a RTX 2070 with CUDA 10, pytorch 1.0, python 3.6 on Ubuntu 18 and I get this error when running this project: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

With torch.backends.cudnn.benchmark = True I get the below stack trace and the program exits.

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
  File "test.py", line 60, in <module>
    model.test()           # run inference
  File "/home/jwickens/dev/face-translation/pytorch-CycleGAN-and-pix2pix/models/base_model.py", line 105, in test
    self.forward()
  File "/home/jwickens/dev/face-translation/pytorch-CycleGAN-and-pix2pix/models/test_model.py", line 65, in forward
    self.fake_B = self.netG(self.real_A)  # G(A)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/dev/face-translation/pytorch-CycleGAN-and-pix2pix/models/networks.py", line 399,in forward
    return self.model(input)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jwickens/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:663

Without that line I get a silent CUDA error once at the beginning. The script works though. THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument

I also have the same silent error with this tutorial https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

There are quite a few issues out here for this error message, some users say its cuda 9.2 and others RTX cards.


(Georgios Georgiadis) #20

RTX 2080ti with cuda 10.0. I got the same problem. I followed the advice of others and turned `torch.backends.cudnn.benchmark = True’ to False and things started working again