How to get rid of “cuda unspecified launch failure” error that happens after a few epochs?

I am getting a CUDA error: unspecified launch failure error when training.It does not work 5 times and it gives this error. It works without error on the 6th stage. But this changes a lot. Sometimes it gives an error 8 times,and 9th works.

for i, (inputs, target, _) in enumerate(train_loader):
		print(torch.cuda.is_available())
		print(len(inputs))
		input_var = [input.cuda() for input in inputs]

Output :

True
136
Traceback (most recent call last):

  File "....\train.py", line 273, in <module>
    train(train_loader, model, criterion, optimizer, epoch)

  File ".....\train.py", line 75, in train
    input_var = [input.cuda() for input in inputs]

  File "......\train.py", line 75, in <listcomp>
    input_var = [input.cuda() for input in inputs]

RuntimeError: CUDA error: unspecified launch failure

Do you have any idea how I can fix the error? Thanks.

  • Windows 10
  • NVIDIA GeForce GTX 1060
  • Torch 1.6
  • Cuda 10.1

Could you update to the latest stable release and check, if this error is still raised after some epochs?
Also, could you check dmesg for any xid errors?