Cannot access a tensor

Hello! Does anyone know if there’s any method to check the value of a tensor if you can’t print it? I got an error message CUDA error: device-side assert triggered when running my code. There seems to be some problem with a tensor. However when debugging, I could only get its shape, and if I try to print its values, I would get another CUDA error (same output after adding export CUDA_LAUNCH_BLOCKING=1):

  File "/home/chenz0f/X-Decoder/xdecoder/backbone/modules/swin3d_layers.py", line 50, in query_knn_feature
    print("src_feat:", src_feat)
  File "/home/chenz0f/anaconda3/envs/v100/lib/python3.10/site-packages/torch/_tensor.py", line 426, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/home/chenz0f/anaconda3/envs/v100/lib/python3.10/site-packages/torch/_tensor_str.py", line 636, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/home/chenz0f/anaconda3/envs/v100/lib/python3.10/site-packages/torch/_tensor_str.py", line 567, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/chenz0f/anaconda3/envs/v100/lib/python3.10/site-packages/torch/_tensor_str.py", line 327, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/chenz0f/anaconda3/envs/v100/lib/python3.10/site-packages/torch/_tensor_str.py", line 361, in get_summarized_data
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/chenz0f/anaconda3/envs/v100/lib/python3.10/site-packages/torch/_tensor_str.py", line 361, in <listcomp>
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "/home/chenz0f/anaconda3/envs/v100/lib/python3.10/site-packages/torch/_tensor_str.py", line 353, in get_summarized_data
    return torch.cat(
RuntimeError: CUDA error: device-side assert triggered

I’m running my code on a V100 GPU. I’ve tried print(), detach(), cpu() and even torch.isnan() on this tensor, but they all return the CUDA assert error. Below is my code:

def query_knn_feature(
    K, src_xyz, query_xyz, src_feat, src_offset, query_offset, return_idx=False
):
    """
    gather feature in the KNN neighborhood
    """
    assert (
        src_xyz.is_contiguous()
        and query_xyz.is_contiguous()
        and src_feat.is_contiguous()
    )
    if query_xyz is None:
        query_xyz = src_xyz
        query_offset = src_offset

    idx, _ = KNN.apply(K, src_xyz, query_xyz, src_offset, query_offset)

    n, m, c = src_xyz.shape[0], query_xyz.shape[0], src_feat.shape[1]
    grouped_feat = src_feat[idx.view(-1).long(), :].view(m, K, c)

    if return_idx:
        return grouped_feat, idx
    else:
        return grouped_feat

You are running into a sticky error and the CUDA context is corrupt. Accessing data or launching any other CUDA operation will either reraise the same error or a new one. Rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args, check which line of code fails, and fix it.

Thanks for the reply, I added export CUDA_LAUNCH_BLOCKING=1 and the output is still the same, it seems that any attempt to access this tensor would trigger the device-side assert. The code would smoothly train for around 20 epochs before having this error, so I really have no idea what might cause it.