Unable to print a tensor

ziqipang · May 2, 2020, 12:52pm

Hi! I have been trying to draw some samples from a weight tensor and do some following stuff. This part is in a very large project, and it sometimes breaks down after being operated millions of iterations.

Therefore, I tried to print out the sample_op and weight tensor to debug. However, I got another bug for the sentence “print(sample_op)”, which looks like I even cannot print out the tensor of sample_op. I wish to get some help, s. Thanks for your attention and help!

The code is like this:

# sample_weight and op_weight are all tensors
sample_op = torch.multinomial(sample_weight, 2, replacement=False)
try:
    probs_slice = F.softmax(torch.stack([
        op_weight.data[i, idx] for idx in sample_op]), 
        dim=0)
except RuntimeError:
    print(sample_op)
    print(op_weight)
    exit(0)

The error log is like this:

File "/cache/user-job-dir/nas-branch/models/model_search.py", line 109, in binarize
    print(sample_op)
  File "/home/work/anaconda/lib/python3.6/site-packages/torch/tensor.py", line 114, in __repr__
    return torch._tensor_str._str(self)
  File "/home/work/anaconda/lib/python3.6/site-packages/torch/_tensor_str.py", line 311, in _str
    tensor_str = _tensor_str(self, indent)
  File "/home/work/anaconda/lib/python3.6/site-packages/torch/_tensor_str.py", line 209, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/work/anaconda/lib/python3.6/site-packages/torch/_tensor_str.py", line 83, in __init__
    value_str = '{}'.format(value)
  File "/home/work/anaconda/lib/python3.6/site-packages/torch/tensor.py", line 361, in __format__
    return self.item().__format__(format_spec)
RuntimeError: CUDA error: device-side assert triggered

tom · May 2, 2020, 1:35pm

The device assert is the “hard error” option for CUDA kernels: It invalidates the CUDA context, which for practical purposes means you need to restart PyTorch. The problem here is that there isn’t any good method to report errors due to the asynchronous nature of CUDA. To find out which kernel is the one erroring, either insert torch.cuda.synchronize() between lines or run with the environment variable CUDA_LANCH_BLOCKING=1 set.
My guess would be torch.multinomial. For example, it is allergic to all-zero probability tensors (and because it returns integers cannot use NaN).

Best regards

Thomas

ziqipang · May 3, 2020, 11:13am

Thanks, I will take a look!