Wired bug while `max()` operation

(Pdb) out.data.max(1)[1].cpu().numpy().max()
18295877782208576
(Pdb) out.data.max(dim=1)[1].cpu().numpy().max()
22518346029203522
(Pdb) out.data.max(dim=0)[1].cpu().numpy().max()
22518346030252096
(Pdb) out.data.max(dim=2)[1].cpu().numpy().max()
799
(Pdb) out.data.max(dim=3)[1].cpu().numpy().max()
799
(Pdb) out.data.max(dim=1)[1].cpu().numpy().max()
18014746402881610
(Pdb) out.data.max(dim=0)[1].cpu().numpy().max()
22518346030252096
(Pdb) out.size()
torch.Size([1, 35, 800, 800])

I guess while calculating out.data.max(dim=1)[1].cpu().numpy().max(), the output should be less than 35. any one could explain this ?

Hi,

Is looks suspicious indeed.

  • Does out.data.max(1)[1].max() return the same value? (without sending it to numpy)
  • What version of pytorch are you using?
  • How is out obtained? what does out.is_contiguous() and out.sparse return?
  • Could you provide us with and small code sample to reproduce this?
  • out.data.max(1)[1].max() get tensor(1.8296e+16, device='cuda:0')
  • pytorch version is 0.4.0
  • the out is the output of deeplabv3+ with GTA5 dataset
  • out.is_contiguous() is True
  • out.sparse get AttributeError: 'Tensor' object has no attribute 'sparse'

another problem is while the out.data.max(1)[1].max() less than 35, error occurs with

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THC/generic/THCTensorCopy.c line=70 error=59 : device-side assert triggered
*** RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THC/generic/THCTensorCopy.c:70

and the error occurs after the loss = CrossEntropyLoss2d(out, targets)

Ok,
I mean out.is_sparse but I guess it’s not really important here.

If you have you code crashing then you should:

  • Run it on cpu. Make sure it runs without error and see if it has the same behaviour.
  • If the cpu version does not crash and return proper values. Then run the cuda version with CUDA_LAUNCH_BLOCKING=1 python your_script.py. This will make cuda do a bit more error checking, notably stop the execution before returning garbage values. With this enabled, you should not see any garbage value anymore and you should see a proper error message.