I am studying Detection2 implementation of maskrcnn. The groundtruth mask prediction is a 28x28 binary tensor created by:
File masks.py:
def polygons_to_bitmask(polygons: List[np.ndarray], height: int, width: int) -> np.ndarray:
"""
Args:
polygons (list[ndarray]): each array has shape (Nx2,)
height, width (int)
Returns:
ndarray: a bool mask of shape (height, width)
"""
assert len(polygons) > 0, "COCOAPI does not support empty polygons"
rles = mask_utils.frPyObjects(polygons, height, width)
rle = mask_utils.merge(rles)
return mask_utils.decode(rle).astype(np.bool)
The default mask heigh and width is 28. Now, a 28x28 binarized mask is very low resolution, yet the model is able to predict much higher fidelity masks. I am struggling to understand how a severely down-samplied ground truth mask used for training can produce a higher resolution masks. Can someone please enlighten me?