Unable to get repr for <class 'torch.Tensor'>

Geoffrey_Payne · March 22, 2021, 9:36pm

With this code;

  for _, (imgs, captions) in tqdm(enumerate(coco_dataloader), total=len(coco_dataloader), leave=False):
        imgs = imgs.to(Constants.device)
        captions = captions.to(Constants.device)

        outputs = model(imgs, captions[:-1])
        outputs1 = outputs.reshape(-1, outputs.shape[2])
        captions1 = captions.reshape(-1)
        loss = criterion(outputs1, captions1)

        print(f"\nTraining loss {loss.item()}, step {step}\n")

I find that when I set the cuda device to captions and images and I inspect their values, I see this exception message “Unable to get repr for <class ‘torch.Tensor’>”
One solution is not to set the device to the outputs, captions and model. but then I get this exception:

“Input, output and indices must be on the current device”.

So I can’t do that either. I appear to be in a bind when trying to fix this, what should I do?
UPDATE
Example of the data where the code fails;
captions = tensor([[ 1],[ 4], [1420], [ 406], [ 139], [ 35], [ 727], [1125], [ 765], [ 5], [ 2]])

imgs = tensor([[[[0.5490, 0.5373, 0.5333, …, 0.5647, 0.5451, 0.5490],
[0.5412, 0.5490, 0.5490, …, 0.5569, 0.5725, 0.5412],
[0.5176, 0.5490, 0.5137, …, 0.5961, 0.5765, 0.5608],
…,
[0.8196, 0.7882, 0.8000, …, 0.9294, 0.9333, 0.9373],
[0.8000, 0.8235, 0.8039, …, 0.9373, 0.9333, 0.9373],
[0.8118, 0.7961, 0.7882, …, 0.9333, 0.9412, 0.9412]],

     [[0.1412, 0.1412, 0.1451,  ..., 0.5333, 0.5333, 0.5569],
      [0.1333, 0.1529, 0.1608,  ..., 0.5176, 0.5451, 0.5333],
      [0.1098, 0.1529, 0.1255,  ..., 0.5412, 0.5333, 0.5255],
      ...,
      [0.8667, 0.8353, 0.8431,  ..., 0.9843, 0.9882, 0.9961],
      [0.8471, 0.8706, 0.8471,  ..., 0.9843, 0.9882, 0.9961],
      [0.8588, 0.8431, 0.8314,  ..., 0.9804, 0.9882, 0.9882]],

     [[0.0275, 0.0235, 0.0235,  ..., 0.5216, 0.5059, 0.5059],
      [0.0196, 0.0353, 0.0392,  ..., 0.5098, 0.5137, 0.4863],
      [0.0000, 0.0353, 0.0039,  ..., 0.5373, 0.5098, 0.4902],
      ...,
      [0.8667, 0.8353, 0.8510,  ..., 0.9843, 0.9882, 0.9843],
      [0.8471, 0.8706, 0.8549,  ..., 0.9843, 0.9882, 0.9843],
      [0.8588, 0.8431, 0.8392,  ..., 0.9804, 0.9882, 0.9804]]]])

Sometimes data like this goes through and there is no obvious reason why it should be any different.

I also get lots of these error messages;

C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: block: [0,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
.
.
.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: block: [1,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\torch\_tensor_str.py", line 372, in _str
    return _str_intern(self)
  File "C:\Anaconda3\lib\site-packages\torch\_tensor_str.py", line 352, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "C:\Anaconda3\lib\site-packages\torch\_tensor_str.py", line 241, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "C:\Anaconda3\lib\site-packages\torch\_tensor_str.py", line 85, in __init__
    value_str = '{}'.format(value)
  File "C:\Anaconda3\lib\site-packages\torch\tensor.py", line 534, in __format__
    return self.item().__format__(format_spec)
RuntimeError: CUDA error: device-side assert triggered
python-BaseException

ptrblck · March 23, 2021, 5:06am

How are you trying to “inspect” these values? Are you using a debugger, which raises the error and is the code working find in the Python interpreter?

Geoffrey_Payne · March 23, 2021, 12:24pm

I have updated my question with more details which hopefully will make it easier to work out what is happening.
I am using Pycharm debugging tools to get the information.
The exception message “Unable to get repr for <class ‘torch.Tensor’>” happens on this line

imgs = imgs.to(Constants.device)

ptrblck · March 24, 2021, 5:50am

Thanks for the update. I don’t know how the PyCharm debugger is working, but regarding the error you are seeing:

Indexing.cu:605: block: [1,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

an indexing operation is failing.
You could either execute the code on the CPU to get a better error message or run the code via CUDA_LAUNCH_BLOCKING=1 python script.py args, which should point to the line of code, which is raising this error.

Geoffrey_Payne · March 24, 2021, 10:59am

I checked again by removing the CUDA and y using the CPU as you suggested. I got a different error compared to last time. My new error message is "index out of range in self ". I found the problem with the embedding definition and fixed it now. Thank you for your help.