With this code;
for _, (imgs, captions) in tqdm(enumerate(coco_dataloader), total=len(coco_dataloader), leave=False):
imgs = imgs.to(Constants.device)
captions = captions.to(Constants.device)
outputs = model(imgs, captions[:-1])
outputs1 = outputs.reshape(-1, outputs.shape[2])
captions1 = captions.reshape(-1)
loss = criterion(outputs1, captions1)
print(f"\nTraining loss {loss.item()}, step {step}\n")
I find that when I set the cuda device to captions and images and I inspect their values, I see this exception message āUnable to get repr for <class ātorch.Tensorā>ā
One solution is not to set the device to the outputs, captions and model. but then I get this exception:
āInput, output and indices must be on the current deviceā.
So I canāt do that either. I appear to be in a bind when trying to fix this, what should I do?
UPDATE
Example of the data where the code fails;
captions = tensor([[ 1],[ 4], [1420], [ 406], [ 139], [ 35], [ 727], [1125], [ 765], [ 5], [ 2]])
imgs = tensor([[[[0.5490, 0.5373, 0.5333, ā¦, 0.5647, 0.5451, 0.5490],
[0.5412, 0.5490, 0.5490, ā¦, 0.5569, 0.5725, 0.5412],
[0.5176, 0.5490, 0.5137, ā¦, 0.5961, 0.5765, 0.5608],
ā¦,
[0.8196, 0.7882, 0.8000, ā¦, 0.9294, 0.9333, 0.9373],
[0.8000, 0.8235, 0.8039, ā¦, 0.9373, 0.9333, 0.9373],
[0.8118, 0.7961, 0.7882, ā¦, 0.9333, 0.9412, 0.9412]],
[[0.1412, 0.1412, 0.1451, ..., 0.5333, 0.5333, 0.5569],
[0.1333, 0.1529, 0.1608, ..., 0.5176, 0.5451, 0.5333],
[0.1098, 0.1529, 0.1255, ..., 0.5412, 0.5333, 0.5255],
...,
[0.8667, 0.8353, 0.8431, ..., 0.9843, 0.9882, 0.9961],
[0.8471, 0.8706, 0.8471, ..., 0.9843, 0.9882, 0.9961],
[0.8588, 0.8431, 0.8314, ..., 0.9804, 0.9882, 0.9882]],
[[0.0275, 0.0235, 0.0235, ..., 0.5216, 0.5059, 0.5059],
[0.0196, 0.0353, 0.0392, ..., 0.5098, 0.5137, 0.4863],
[0.0000, 0.0353, 0.0039, ..., 0.5373, 0.5098, 0.4902],
...,
[0.8667, 0.8353, 0.8510, ..., 0.9843, 0.9882, 0.9843],
[0.8471, 0.8706, 0.8549, ..., 0.9843, 0.9882, 0.9843],
[0.8588, 0.8431, 0.8392, ..., 0.9804, 0.9882, 0.9804]]]])
Sometimes data like this goes through and there is no obvious reason why it should be any different.
I also get lots of these error messages;
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: block: [0,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
.
.
.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:605: block: [1,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\torch\_tensor_str.py", line 372, in _str
return _str_intern(self)
File "C:\Anaconda3\lib\site-packages\torch\_tensor_str.py", line 352, in _str_intern
tensor_str = _tensor_str(self, indent)
File "C:\Anaconda3\lib\site-packages\torch\_tensor_str.py", line 241, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "C:\Anaconda3\lib\site-packages\torch\_tensor_str.py", line 85, in __init__
value_str = '{}'.format(value)
File "C:\Anaconda3\lib\site-packages\torch\tensor.py", line 534, in __format__
return self.item().__format__(format_spec)
RuntimeError: CUDA error: device-side assert triggered
python-BaseException