Overflow and get a very large number when using torch.topk

When I use torch.topk to get the index, I find that the program will stop quickly after several training iters because of Pyorch-CUDA error: device-side assert triggered, THCTensorScatterGather, Assertion indexValue failed.
I printed the index and found sometimes torch.topk will get a very large number such as tensor(9223372034707292159, device=‘cuda:0’) and as a results, the program cannot get the corresponeding data by this large index. How to solve this overlflow problem?

1 Like

Could you post a code snippet to reproduce this error, please?
I assume you are using float32 tensors?
If you cannot post the code, could you post the shape of the input tensor and its range?

@ptrblck The code is changed from https://github.com/open-mmlab/mmdetection/blob/master/mmdet/models/anchor_heads/free_anchor_retina_head.py.

match_matrix = bboxoverlaps(gt_bboxes, bbox_preds)
# match_matrix: torch.float32, shapes: [num_gts, num_anchors]
matched_iou, matched = torch.topk(match_matrix, 50, dim=1, sorted=False)

# cls_prob:  torch.float32, shapes: [num_anchors, 80]
temp = cls_prob[matched]

I have some difficulty reproducing this error, because this error happened very accidentally. Sometimes, it will appear immediately after training, sometimes, a few epochs it will appear, and sometimes, I can complete the training process. I am very distressed.

Just to clarify, match_matrix contains these invalid values (9223372034707292159) sometimes and lets topk crash.
If that’s the case, it seems that bboxoverlaps is creating wrong results sometimes and I would suggest to check the indexing in the method.

@ptrblck It means I need to check the bbooverlaps results? How to understand wrong results, such as NAN? Is there any other situation?
Besides, when I set a breakpoint, I found that once the above error occurs, all variables will become Unable to get repr for <class 'torch.Tensor'>, cannot be viewed, what should I do?

Yes. I’m not sure what tensors are passed to this method, but the calculation might be wrong for “unexpected” tensors or you might run into an overflow.

NaN values can be created in various ways. Based on the posted tensor value, it doesn’t seem that you are running into NaNs but invalid v alues.

I’m not sure which debugger you are using, but you could add assert statements (or conditions) and print the desired values.