ZeroDivisionError on using torchvision's MaskRCNN

Hello, everyone. I am using the pretrained torchvision MaskRCNN model on a dataset containing several videos, by passing the videos to the model frame by frame. However in the forward pass of the MaskRCNN model I get the following error statement:

RuntimeError                              Traceback (most recent call last)

----> 1 output = model(images,targets)

4 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
     64             original_image_sizes.append((val[0], val[1]))
     65 
---> 66         images, targets = self.transform(images, targets)
     67         features = self.backbone(images.tensors)
     68         if isinstance(features, torch.Tensor):

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/transform.py in forward(self, images, targets)
     42                 raise ValueError("images is expected to be a list of 3d tensors "
     43                                  "of shape [C, H, W], got {}".format(image.shape))
---> 44             image = self.normalize(image)
     45             image, target_index = self.resize(image, target_index)
     46             images[i] = image

/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/transform.py in normalize(self, image)
     62         mean = torch.as_tensor(self.image_mean, dtype=dtype, device=device)
     63         std = torch.as_tensor(self.image_std, dtype=dtype, device=device)
---> 64         return (image - mean[:, None, None]) / std[:, None, None]
     65 
     66     def torch_choice(self, l):

RuntimeError: ZeroDivisionError

My images in this code is a list of size 431, denoting the number of frames in the video, with each item of the list being a torch tensor of shape 3 x 256 x 320, denoting channels x height x width, respectively. Targets here contains the labels, masks and bounding boxes coordinates in the form of a dictionary of lists. The length of the targets list is also 431, with each item containing information about each individual frame.

The code I am using is similar to the one found in this tutorial, except for the fact that that tutorial deals with static images, and I am working with videos, so I made only some minor changes to that code.

It seems that the error arises because in one of the images, the standard deviation of the pixel values comes out to be zero, which is highly unlikely if not impossible. Is there any other reason for this error to occur?

Are you passing the image_mean and image_std manually to the transformation?
If not, could you post a small code snippet showing, how you are calling the model including all input shapes, so that we can have a look?

I am not passing the image_mean and Image_std manually.

Sure, this is how I am calling the model:

images,targets = next(iter(trainloader))
images = list(torch.squeeze(image) for image in images) 
print(len(images))                                      // Will output 431
print(images[0].shape)                              // Will output torch.size[3,256,320]

targets = [{k: v for k, v in t.items()} for t in targets]
print(len(targets))                                     // Will output 431

output = model(images,targets)

Here 431 is the number of frames in my first video. Targets contains the bounding box coordinates, masks and image ids for each individual frame. It is a list of dictionaries. I hope that this is fine?

Could you take a look at this post, which had a similar issue and solved it by making sure to use a normalized FloatTensor?