Overflow and get a very large number when using torch.topk

zhoulukuan · April 24, 2020, 4:21pm

When I use torch.topk to get the index, I find that the program will stop quickly after several training iters because of Pyorch-CUDA error: device-side assert triggered, THCTensorScatterGather, Assertion indexValue failed.
I printed the index and found sometimes torch.topk will get a very large number such as tensor(9223372034707292159, device=‘cuda:0’) and as a results, the program cannot get the corresponeding data by this large index. How to solve this overlflow problem?

ptrblck · April 25, 2020, 9:23am

Could you post a code snippet to reproduce this error, please?
I assume you are using float32 tensors?
If you cannot post the code, could you post the shape of the input tensor and its range?

zhoulukuan · April 25, 2020, 2:59pm

@ptrblck The code is changed from https://github.com/open-mmlab/mmdetection/blob/master/mmdet/models/anchor_heads/free_anchor_retina_head.py.

match_matrix = bboxoverlaps(gt_bboxes, bbox_preds)
# match_matrix: torch.float32, shapes: [num_gts, num_anchors]
matched_iou, matched = torch.topk(match_matrix, 50, dim=1, sorted=False)

# cls_prob:  torch.float32, shapes: [num_anchors, 80]
temp = cls_prob[matched]

I have some difficulty reproducing this error, because this error happened very accidentally. Sometimes, it will appear immediately after training, sometimes, a few epochs it will appear, and sometimes, I can complete the training process. I am very distressed.

ptrblck · April 25, 2020, 10:49pm

Just to clarify, match_matrix contains these invalid values (9223372034707292159) sometimes and lets topk crash.
If that’s the case, it seems that bboxoverlaps is creating wrong results sometimes and I would suggest to check the indexing in the method.

zhoulukuan · April 26, 2020, 12:15am

@ptrblck It means I need to check the bbooverlaps results? How to understand wrong results, such as NAN? Is there any other situation？
Besides, when I set a breakpoint, I found that once the above error occurs, all variables will become Unable to get repr for <class 'torch.Tensor'>, cannot be viewed, what should I do?

ptrblck · April 26, 2020, 2:02am

Yes. I’m not sure what tensors are passed to this method, but the calculation might be wrong for “unexpected” tensors or you might run into an overflow.

NaN values can be created in various ways. Based on the posted tensor value, it doesn’t seem that you are running into NaNs but invalid v alues.

I’m not sure which debugger you are using, but you could add assert statements (or conditions) and print the desired values.

Boltzmachine · June 4, 2024, 2:15am

I also encounter this issue. The tensor does not contain NaNs.
The solution is to run the same code twice… and I do not know why it works

idx = tensor.topk(...)
idx = tensor.topk(...)

And if I run it on CPU, it works

beny-maleki · January 19, 2025, 3:24pm

I have also run into the same issue! When I run the same line twice, the second result has valid values that are in the expected index range. The first result consists of strangely large values. Is this some sort of bug?

Xiuchen519 · April 29, 2025, 7:41am

Also encountered this problem. I reduce the batch size and it works. It may be caused by OOM, but there is no OOM error reported.