RuntimeError: cuda runtime error (59) : device-side assert triggered , 2 different places please help me understand

Hi guys, I’m getting this kind of error in two places:"RuntimeError: cuda runtime error (59) : device-side assert triggered ".

  1. when calculating accuracy:
      _, pred = outputs.topk(1, 1, True)
      pred = pred.t()
      correct = pred.eq(targets.view(1, -1))
      n_correct_elems = correct.float().sum().data[0]

Tried to do this:

       n_correct_elems = correct.float().sum().item

still didn’t help the exect error is:

/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/…/THCReduceAll.cuh line=317 error=59 : device-side assert triggered
Traceback (most recent call last):
File “main.py”, line 141, in
train_logger, train_batch_logger)
File “/path/train.py”, line 37, in train_epoch
acc = calculate_accuracy(outputs, targets)
File “/path/utils.py”, line 58, in calculate_accuracy
n_correct_elems = correct.float().sum().item()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/…/THCReduceAll.cuh:317

I wanted to go forward, so I skipped this function (just entered a number), and then I got, that he is falling in csv.writer (I checked the path for the file):

the exact error log is:

/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generic/THCTensorCopy.cpp line=70 error=59 : device-side assert triggered
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f80c09ac748>>
Traceback (most recent call last):
  File "path/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 399, in __del__
    self._shutdown_workers()
  File "path/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
    self.worker_result_queue.get()
  File "path/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "path/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
    fd = df.detach()
  File "path/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "path/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 737, in answer_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "path/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "main.py", line 141, in <module>
    train_logger, train_batch_logger)
  File "path/train.py", line 55, in train_epoch
    'lr': optimizer.param_groups[0]['lr']
  File "path/utils.py", line 41, in log
    self.logger.writerow(write_values)
  File "path/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 57, in __repr__
    return torch._tensor_str._str(self)
  File "path/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 256, in _str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "path/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 82, in __init__
    copy = torch.empty(tensor.size(), dtype=torch.float64).copy_(tensor).view(tensor.nelement())
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generic/THCTensorCopy.cpp:70

I must day the I debugged on the windows, and the program ran, now I’m running it on distance machine, and it’s killing me, every step different error, I’m not far from breaking down, please help.

Thanks.

1 Like

This is the error message. The target tensor you passed to nll_loss has some out-of-bound class.

You can’t recover from a CUDA device assert failure. This is a CUDA limitation. All you can do is fixing the bug and restart your process.

can you elaborate just a bit, what does it means “out-of-bounds class” and what is the nll_loss (last one for curiosity), the sizes of each tensor were checked carefully, I’ll be happy for your help, because it’s seems that every step I’m getting this error , also in:

 targets = targets.to('cuda')

Edit, I checked again and it works on cpu with batch = 1, when batch > 1, I get some similar error:
“cur_target >= 0 && cur_target < n_classes”
One more thing, it crashes only after few steps… is this “hinting” about the problem.

Thanks.

The out of bounds error is thrown if you pass class indices, which are negative (t < 0) or greater or equal to the number of classes (t >= n_classes).
E.g. if you have 5 classes, the class indices shoule be in [0, 4].

Could you check your target tensor for these out of bounds values?

nn.NLLLoss is the negative log likelihood loss which is used in the usual classification use case.
nn.CrossEntropyLoss uses internally a nn.LogSoftmax layer and nn.NLLLoss to calculate the loss.

1 Like

Thanks for your answer, But I still have some issues with that.
I’m debugging on my windows (cpu only) , and I don’t have this problem, but when I take exactly the same files to my machine on (unix with 1 GPU) I do get this error, do you have any Idea , why is that happening?
maybe the nn.CrossEntropyLoss.cuda() is acting differently?

Thanks.

That’s a bit strange indeed!
Could you load all your target values and check for the value range?
torch.unique might be helpful or just a comparison.

If you don’t find any out of bounds values, would it be possible to provide the target tensor, i.e. upload it somewhere? I would like to have a look at it, as I would like to exclude a possible silent bug on the CPU side.

After printing the target I see I have out of bound values.
I’ll go figure out why

1 Like

Continuing the discussion from RuntimeError: cuda runtime error (59) : device-side assert triggered , 2 different places please help me understand:

Sorry,I have the same problem as you.Can I ask that why the problem of “it crashes only after few steps… is this “hinting” about the problem.” happened?And finally how you solve your problem.I am not far from breakind down,please give me some advices,thanks very much.

I too face the same issue sporadically. I am using GPU. during first few runs, there are no issues but all of sudden receiving the error even there is no change in the code.

Can someone highlight how this can be fixed?

I found the total number of output class of multi-class claasifier is less than number of labels. That’s why I got this error.

As for me:
number of label: 0-31, 32 classes.
number of output: 0-30, 31 classes.

I met the same issue with you.
Did you solve it?
Can you tell me how to solve it?
Best wishes!

1 Like