Inference works on dataset but not on standalone image

I saved a model’s state dict and am trying to load it back to perform inference on random images that weren’t in the training/validation dataset. Here’s my code:

VERSION = 1
ROOT = "some/path"
STATE_DICT_PATH = ROOT + 'state_dict.pth'

num_classes = 2
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = get_instance_segmentation_model(VERSION, num_classes)
model.load_state_dict(torch.load(STATE_DICT_PATH))
model.to(device)

img = Image.open(ROOT + "images/some_image.jpg")
img = F.pil_to_tensor(img)
img = F.convert_image_dtype(img)

model.eval()
with torch.no_grad():
    out = model([img.to(device)])

The error I get:
RuntimeError: The expanded size of the tensor (14204) must match the existing size (0) at non-singleton dimension 1. Target sizes: [10652, 14204]. Tensor sizes: [0, 0]

If however I rebuild the training dataset like in the tutorial:

dataset = MyDataset(ROOT, get_transform(train=False))
img, _ = dataset[60]

it runs fine. I don’t understand why since in both cases I’m passing a PIL image converted to tensor to model… A comment: my original images are rather large (14000x10000 pixels) so I had them resized for training, but would like to predict on the full-size sources.

Based on the error message it seems your input is empty.
Could you check img.shape before passing it to the model?

I just did:

img = Image.open(ROOT +"images/P007CX14f052_1.jpg")
img = F.pil_to_tensor(img)
img = F.convert_image_dtype(img)

print(img.shape)

Output: torch.Size([3, 10652, 14204])

Here’s the full trace:

File "individualizador/predict.py", line 42, in <module>
    out = model([img.to(device)])
  File "/home/ims/.local/share/virtualenvs/individualizacao-BUoPwUHr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ims/.local/share/virtualenvs/individualizacao-BUoPwUHr/lib/python3.8/site-packages/torchvision/models/detection/generalized_rcnn.py", line 106, in forward
    detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)  # type: ignore[operator]
  File "/home/ims/.local/share/virtualenvs/individualizacao-BUoPwUHr/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 261, in postprocess
    masks = paste_masks_in_image(masks, boxes, o_im_s)
  File "/home/ims/.local/share/virtualenvs/individualizacao-BUoPwUHr/lib/python3.8/site-packages/torchvision/models/detection/roi_heads.py", line 484, in paste_masks_in_image
    res = [paste_mask_in_image(m[0], b, im_h, im_w) for m, b in zip(masks, boxes)]
  File "/home/ims/.local/share/virtualenvs/individualizacao-BUoPwUHr/lib/python3.8/site-packages/torchvision/models/detection/roi_heads.py", line 484, in <listcomp>
    res = [paste_mask_in_image(m[0], b, im_h, im_w) for m, b in zip(masks, boxes)]
  File "/home/ims/.local/share/virtualenvs/individualizacao-BUoPwUHr/lib/python3.8/site-packages/torchvision/models/detection/roi_heads.py", line 424, in paste_mask_in_image
    im_mask[y_0:y_1, x_0:x_1] = mask[(y_0 - box[1]) : (y_1 - box[1]), (x_0 - box[0]) : (x_1 - box[0])]
RuntimeError: The expanded size of the tensor (14204) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [10652, 14204].  Tensor sizes: [0, 0]

@ptrblck definitely seems related to image size. Resizing the input with the same factor used for the dataset gets rid of the error. Did I train a model that can’t handle my full size images? I was running into RAM issues when dealing with the masks, an image containing 30 object instances would create a [30, 10000, 14000] mask tensor, hence the resizing

I get the same error message if I try to assign an empty mask to im_mask as seen here:

y_0, y_1 = 0, 10652
x_0, x_1 = 0, 14204
box = [x_1, y_1]
im_mask = torch.zeros(y_1, x_1)
mask = torch.randint(0, 2, (y_1, x_1)).bool()

im_mask[y_0:y_1, x_0:x_1] = mask[(y_0 - box[1]) : (y_1 - box[1]), (x_0 - box[0]) : (x_1 - box[0])]
# RuntimeError: The expanded size of the tensor (14204) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [10652, 14204].  Tensor sizes: [0, 0]

so check what x_0/1, y_0/1 and box contains and make sure the mask indexing doesn’t return an empty tensor.

Right, but why is model returning an empty mask for the full-size image but works with the resized one?

Is your suggestion to include debugging in the source code? paste_mask_in_image gets called internally when I try to predict with model

I’m not sure if your model is able to detect the objects using a different input resolution or if it needs more training on these. Generally I would expect to see differences in CNNs as the kernels were trained to extract features from the inputs used during training while these features might be different now. In a very simplified way think about the first kernels as edge detectors, which could fire a strong signal for a low resolution image (assuming these conv layers were trained on it) while the “edge” might expand to a larger pixel space in a high-res image and thus these edge detectors would see an almost constant input.

1 Like

Ah ok, that’s probably what’s going on. I thought Mask R-CNN was insensitive to scale differences since it internally rescales the inputs anyway. I’ll try to find a sweet spot between image size and RAM consumption to see if I can get this to work, thanks!

edit: would tiling the images so they fit in memory help in this case?

However, IMNO (in my noob opinion) a more informative error could help in this case. Maybe check if the masks are empty before passing them to paste_masks_in_image and raise earlier so the user knows nothing was detected? Not sure how global this solution would be though, just my two cents. The way it is now looks like there’s a problem with my input.

That’s a good point and I might be wrong if these images are indeed resized internally.

Yes, I also think the error might be improved. However, I’m unsure if a value check is strictly needed as it could add a synchronization which you generally want to avoid.
CC @pmeier for visibility

Providing a better error message is usually a good thing. Given that there might be some perf regressions, we need to be careful where we would put such a check. @martimpassos Can you provide an MRE that either doesn’t rely on your data, e.g. replace image with torch.rand() or the like and so on, or provide your data, so we can reproduce?

Just getting back to this, I’m currently unable to access my work computer (moving offices) but will send an MRE asap. Thanks for your support!