CUDA error: device-side assert triggered(insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)

Hi, all, when using pytorch for training, I met some problems:


pytorch version: 1.2.0.dev20190715
cuda version: 9.0
gpu: GTX 1080ti.

I’ve spent some time to debug this problem, but failed unfortunately. Hope to get some help here!

Do you see the same error, if you run the code on CPU?
This might yield a clearer error message than the current CUDA one.

If it’s working fine on the CPU, could you rerun the code using

CUDA_LAUNCH_BLOCKING=1 python script.py args

and post the stack trace again?

PS: You can post code directly by wrapping it in three backticks ``` :wink:

See ptrblck’s post for ways to debug the issue. But any reason you’re apparently both using an old nightly of 1.2.0 when 1.2.0 has now been released and also running against CUDA 9.0 when PyTorch is built against 9.2? One of those may well be the cause of the issue.

I’ve tried to rerun the code with “CUDA_LAUNCH_BLOCKING=” before python. Unfortunately, the code couldn’t run into the training process with this flag.
I’ve added “torch.autograd.set_detect_anomaly(True)” into the code, then the error became:

 File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 539, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1365, in linear
    ret = torch.addmm(bias, input, weight.t())

Traceback (most recent call last):
  File "/workspace/pyface/engine/trainer.py", line 287, in <module>
    main(args)
  File "/workspace/pyface/engine/trainer.py", line 69, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "/workspace/pyface/engine/trainer.py", line 137, in main_worker
    do_train(train_loader, model, criterion, optimizer, epoch, args, architect, valid_loader)
  File "/workspace/pyface/engine/trainer.py", line 270, in do_train
    loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'AddmmBackward' returned nan values in its 2th output. 

It seems some nan in the grad?

Thanks for your advice, I’ve already changed the pytorch version to 1.2(stable), cuda10.0 cudnn7.6.
The error remains the same.

I found when I run the training with batchsize 1024 on 8 1080 ti gpu cards, it will report the error. When I run the code with batchsize 512 on 8 gpu cards, the error disappeared. when I run the code with smaller batchsize on a single gpu card, the error can’t be reproduced. So it might be hard to reproduce the error on cpu.

Yes, looks like a bug in the PyTorch AddmmBackward causing it to output a nan (or something in the autograd). Unless someone else can see something I’m missing you should probably submit an issue. It all seems to be PyTorch code.
In order to get a reliable reproduction you could try adding code in your training loop to store a reference to the current inputs (making sure to overwrite each batch and not keep old ones around). Then when the error happens you should have the inputs that caused it. You will also want to save the model weights as it will depend on these. Then hopefully you can get a set of inputs and weights that will reliably trigger it.

The real issue is that in topk an invariant condition failed. It would be very useful if you can

  1. run with CUDA_LAUNCH_BLOCKING=1
  2. save every inputs to topk to a file (you can overwrite in each iteration)
  3. upload the input that triggers this assert.

Thanks!

Really thanks for the reply. I tried to reproduced the error at another gpu machine using same docker container setting, while found no error occurs. The error may be highly related to a broken 1080ti gpu card, Although the regular checks don’t show the error.