Mask Loss in Mask R-CNN returns negative values

sigma_x · February 7, 2020, 11:22am

Apparently Mask head uses BCEWithLogits, so it combines sigmoid+CE, to avoid competition between masks. I get a negative value. Is this a bug, or someting’s wrong with my encoding? I encoded each target mask as a uint8 array, where all pixels are 0 except the object’s mask, which are set to the masks’s class (i.e it matches the target class).

ptrblck · February 8, 2020, 7:46am

Could you post a link to the repository you are using?
Also, could you post the inputs and targets for a negative loss value?
Negative values for nn.BCE(WithLogits)Loss can be created if your target is out of bounds.

sigma_x · February 8, 2020, 10:05am

I’m using this one:

ptrblck · February 8, 2020, 8:15pm

Thanks for the link. Have you checked the target values?

sigma_x · February 11, 2020, 8:19am

How do I correctly label the mask? If I have K objects across N classes in one image, the mask label array must be HxWxK, but in each map, do the mask values have to be =1 or =Class_number?

ptrblck · February 12, 2020, 5:49am

If you are using nn.BCE(WithLogits)Loss, the target should have the same shape as your output activation, where the channel index corresponds to the class index.
A one indicates the presence of the class at this spatial position, while a zero does not.

sigma_x · February 13, 2020, 9:43am

This is from the mask_rcnn.py:

    During training, the model expects both the input tensors, as well as a targets dictionary,
    containing:
        - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
          between 0 and H and 0 and W
        - labels (Tensor[N]): the class label for each ground-truth box
        - masks (Tensor[N, H, W]): the segmentation binary masks for each instance

here N is a ground-truth box, not class index.

So if I have (for example) 3 objects of class 1 and 2 objects of class2 and 1 object of class 3, image size 224x224, the mask array will be size (6x224x224), each map will be 0 for background and 1 for object. During training BCE is applied pixelwise, and the correct class is taken from the labels vector. Since for each object its class/label, bbox and mask are appended to a list, the order stays the same. so the correct class is taken from the labels vector.