A question about finetuning a pre-trained Mask R-CNN model

what happened if we dont feed boxes into network? Can we still get mask from it?

During training MaskRCNN expects an input and target tensor containing boxes, labels, and masks, while only the input is required to return the aforementioned tensors as well as the scores.