Hello, everyone. I am using the pretrained torchvision MaskRCNN model on a dataset containing several videos, by passing the videos to the model frame by frame. However in the forward pass of the MaskRCNN model I get the following error statement:
RuntimeError Traceback (most recent call last)
----> 1 output = model(images,targets)
4 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
548 result = self._slow_forward(*input, **kwargs)
549 else:
--> 550 result = self.forward(*input, **kwargs)
551 for hook in self._forward_hooks.values():
552 hook_result = hook(self, input, result)
/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
64 original_image_sizes.append((val[0], val[1]))
65
---> 66 images, targets = self.transform(images, targets)
67 features = self.backbone(images.tensors)
68 if isinstance(features, torch.Tensor):
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
548 result = self._slow_forward(*input, **kwargs)
549 else:
--> 550 result = self.forward(*input, **kwargs)
551 for hook in self._forward_hooks.values():
552 hook_result = hook(self, input, result)
/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/transform.py in forward(self, images, targets)
42 raise ValueError("images is expected to be a list of 3d tensors "
43 "of shape [C, H, W], got {}".format(image.shape))
---> 44 image = self.normalize(image)
45 image, target_index = self.resize(image, target_index)
46 images[i] = image
/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/transform.py in normalize(self, image)
62 mean = torch.as_tensor(self.image_mean, dtype=dtype, device=device)
63 std = torch.as_tensor(self.image_std, dtype=dtype, device=device)
---> 64 return (image - mean[:, None, None]) / std[:, None, None]
65
66 def torch_choice(self, l):
RuntimeError: ZeroDivisionError
My images
in this code is a list of size 431, denoting the number of frames in the video, with each item of the list being a torch tensor of shape 3 x 256 x 320, denoting channels x height x width, respectively. Targets
here contains the labels, masks and bounding boxes coordinates in the form of a dictionary of lists. The length of the targets
list is also 431, with each item containing information about each individual frame.
The code I am using is similar to the one found in this tutorial, except for the fact that that tutorial deals with static images, and I am working with videos, so I made only some minor changes to that code.
It seems that the error arises because in one of the images, the standard deviation of the pixel values comes out to be zero, which is highly unlikely if not impossible. Is there any other reason for this error to occur?