Dear,
I meet the cuda runtime error (59) : device-side assert triggered
error when using torch.topk()
, can you help me?
Problem:
The error related code is:
_feature = _feature.view(_b, _c, -1) # [N, C, K]
assert _feature is not None, "Error-1"
_feature_sum = torch.sum(_feature.pow(2), dim=1) # [N, K]
assert _feature is not None, "Error-2"
_idx = torch.topk(_feature_sum, top, dim=-1, sorted=False)[1] # [N, top]
if _idx.max() > 10000:
print("_idx error")
print(_idx.max())
_jdx = torch.arange(_b).unsqueeze(1).repeat(1, top)
_feature = _feature[_jdx, :, _idx] # [N, top, C]
This code, given a tensor _feature
with size [NxCxK]
, returns the top
samples with each sample having the size of [NxC]
, forming the [NxTOPxC]
.
The major code _idx = torch.topk(_feature_sum, top, dim=-1, sorted=False)[1] # [N, top]
, where top=50, _feature_sum.shape=(N, 64*64)
, while sometime, the calculated _idx.max()
will be wrong, like:
NaN or Inf found in input tensor.
NaN or Inf found in input tensor.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [197,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [197,0,0], thread: [33,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
.......
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [37,0,0], thread: [95,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=59 : device-side assert triggered
_idx error
tensor(9223372034707292159, device='cuda:0')
Traceback (most recent call last):
File "tools/train.py", line 153, in <module>
main()
File "tools/train.py", line 109, in main
eval_loss=eval_loss
File "tools/functions.py", line 33, in train_epoch
out_dict = model_runner.train_one_batch(batch)
File "models/basemodel_runner.py", line 176, in train_one_batch
return self.forward(inp_dict)
File "models/basemodel_runner.py", line 131, in forward
loss_feature = self._feature_loss(_features, _mask)
File "models/basemodel_runner.py", line 197, in _feature_loss
feature_fg = self._sample_features(_feature, _mask, sample_fg, top=self.mask_sample_topk * 2) # [N, top, C]
File "models/basemodel_runner.py", line 250, in _sample_features
if _idx.max() > 10000:
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generated/../THCReduceAll.cuh:317
It seems that the error is caused by the topk()
function. I have searched for the similar error, however, the similar ones are due to the class with number <0
in classification task.
Anyone can help me?