I train my network and raise the error: THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument.
Enviroment:
- docker: yes
- cuda: 10.0
- python: 3.7
- pytorch: 1.0
- cudnn: 7
simple easy example:
import torch
from torchvision.models import vgg16
model = vgg16().cuda()
x = torch.zeros((32, 3, 227, 227)).cuda()
model(x)
out:
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
tensor([[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
…,
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.]], device=‘cuda:0’,
grad_fn=)
I found that after this error, if i run it the second time, no error raise, and output the right answer.
model(x)
tensor([[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
…,
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.]], device=‘cuda:0’,
grad_fn=)
so in my code, if i add an simple example to the beginning of my code, the error raised, and then ignore it. after that my code runs normally to train.
crazy…
but it works!!!
Anyone know the reason???
but then comes the evil…
I found that in training it works normal, but when i run with 'with torch.no_grad(): ', with the same net, same weights, same dataset, the output of conv are all nan.
Actually I don’t know if these issues are related… it’s also possible that i hit two…
anyone comes across the same issue?