Detection2 Mask head resolution question

I am studying Detection2 implementation of maskrcnn. The groundtruth mask prediction is a 28x28 binary tensor created by:


def polygons_to_bitmask(polygons: List[np.ndarray], height: int, width: int) -> np.ndarray:
        polygons (list[ndarray]): each array has shape (Nx2,)
        height, width (int)

        ndarray: a bool mask of shape (height, width)
    assert len(polygons) > 0, "COCOAPI does not support empty polygons"
    rles = mask_utils.frPyObjects(polygons, height, width)
    rle = mask_utils.merge(rles)
    return mask_utils.decode(rle).astype(np.bool)

The default mask heigh and width is 28. Now, a 28x28 binarized mask is very low resolution, yet the model is able to predict much higher fidelity masks. I am struggling to understand how a severely down-samplied ground truth mask used for training can produce a higher resolution masks. Can someone please enlighten me?

I’m not completely sure, but I would assume that the output might be interpolated to the input shape as seen here.