Invalid argument error with 2080Ti and cuda10, this removed but along with other error

janehu · April 26, 2019, 1:51am

I train my network and raise the error: THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument.

Enviroment:

docker: yes
cuda: 10.0
python: 3.7
pytorch: 1.0
cudnn: 7

simple easy example:
import torch
from torchvision.models import vgg16
model = vgg16().cuda()
x = torch.zeros((32, 3, 227, 227)).cuda()
model(x)

out:
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
tensor([[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
…,
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.]], device=‘cuda:0’,
grad_fn=)

I found that after this error, if i run it the second time, no error raise, and output the right answer.

model(x)
tensor([[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
…,
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.]], device=‘cuda:0’,
grad_fn=)

so in my code, if i add an simple example to the beginning of my code, the error raised, and then ignore it. after that my code runs normally to train.
crazy…
but it works!!!
Anyone know the reason???

but then comes the evil…
I found that in training it works normal, but when i run with 'with torch.no_grad(): ', with the same net, same weights, same dataset, the output of conv are all nan.
Actually I don’t know if these issues are related… it’s also possible that i hit two…
anyone comes across the same issue?

janehu · April 28, 2019, 3:05am

Test on a different GPU, same docker image, and same code, works all fine.