Cuda runtime error (59) : device-side assert triggered while using torch.topk()

Dear,
I meet the cuda runtime error (59) : device-side assert triggered error when using torch.topk() , can you help me?

Problem:

The error related code is:

_feature = _feature.view(_b, _c, -1)  # [N, C, K]
assert _feature is not None, "Error-1"
_feature_sum = torch.sum(_feature.pow(2), dim=1)  # [N, K]
assert _feature is not None, "Error-2"
_idx = torch.topk(_feature_sum, top, dim=-1, sorted=False)[1]  # [N, top]
if _idx.max() > 10000:
    print("_idx error")
    print(_idx.max())
_jdx = torch.arange(_b).unsqueeze(1).repeat(1, top)
_feature = _feature[_jdx, :, _idx]  # [N, top, C]

This code, given a tensor _feature with size [NxCxK], returns the top samples with each sample having the size of [NxC], forming the [NxTOPxC].
The major code _idx = torch.topk(_feature_sum, top, dim=-1, sorted=False)[1] # [N, top], where top=50, _feature_sum.shape=(N, 64*64), while sometime, the calculated _idx.max() will be wrong, like:


NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [197,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [197,0,0], thread: [33,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
.......
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [37,0,0], thread: [95,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=59 : device-side assert triggered
_idx error
tensor(9223372034707292159, device='cuda:0')
Traceback (most recent call last):
  File "tools/train.py", line 153, in <module>
    main()
  File "tools/train.py", line 109, in main
    eval_loss=eval_loss
  File "tools/functions.py", line 33, in train_epoch
    out_dict = model_runner.train_one_batch(batch)
  File "models/basemodel_runner.py", line 176, in train_one_batch
    return self.forward(inp_dict)
  File "models/basemodel_runner.py", line 131, in forward
    loss_feature = self._feature_loss(_features, _mask)
  File "models/basemodel_runner.py", line 197, in _feature_loss
    feature_fg = self._sample_features(_feature, _mask, sample_fg, top=self.mask_sample_topk * 2)  # [N, top, C]
  File "models/basemodel_runner.py", line 250, in _sample_features
    if _idx.max() > 10000:
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generated/../THCReduceAll.cuh:317

It seems that the error is caused by the topk() function. I have searched for the similar error, however, the similar ones are due to the class with number <0 in classification task.

Anyone can help me?

Try running the above code in the cpu. You will get a trace of the actual error

which version of pytorch you used? I can run this part code alone successfully.

In [1]: import torch

In [2]: _feature = torch.rand(4,6,50)

In [3]: top = 10

In [4]: _b, _c = 4, 6

In [5]: _feature = _feature.view(_b, _c, -1)  # [N, C, K]
   ...: assert _feature is not None, "Error-1"
   ...: _feature_sum = torch.sum(_feature.pow(2), dim=1)  # [N, K]
   ...: assert _feature is not None, "Error-2"
   ...: _idx = torch.topk(_feature_sum, top, dim=-1, sorted=False)[1]  # [N, top]
   ...: if _idx.max() > 10000:
   ...:     print("_idx error")
   ...:     print(_idx.max())
   ...: _jdx = torch.arange(_b).unsqueeze(1).repeat(1, top)
   ...: _feature = _feature[_jdx, :, _idx]  # [N, top, C]

In [6]: _feature.shape
Out[6]: torch.Size([4, 10, 6])

The problem above appears randomly.

The stack trace points to an invalid index operation, so make sure to keep the bounds of idx:

block: [197,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed

Yes, the error appears when _idx.max() is larger than the shape of dimension.
However, the _idx is obtained by the torch.topk() method. Thus, this error actually caused by the topk().