No mask or box regression loss when training Mask R-CNN (and loading 32bit Float grayscale TIFF with CocoDataset)

I’m trying to use Mask R-CNN with torchvision.
I have a dataset in the COCO format with data being 32bit float TIFF images. When I want to load the data the CocoDataset class offers the possibility to pass a transform. According to the docstring the transform parameter should be: “A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.PILToTensor
I’ve tried the following two functions as transform:

def my_transform(img: Image):
    img_tensor = F.to_tensor(img)
    return img_tensor / torch.max(img_tensor)
def my_transform(img: Image):
    img_tensor = torch.Tensor(np.array(img))
    return img_tensor / torch.max(img_tensor)

When I initialize the dataset with
ds = CocoDetection(root=data_dir, annFile data_dir / 'annotations.json', transform=load_image_transform)
The target data is fine, but the image data consists of tensor with pixel values of either 0. or 1. (dtype=torch.float32). When I apply my_transform to a PIL Image it works like a charm.

I’ve read the code of the VisionDataset class because CocoDataset calls super().__init__(*args) in which self.transform and thus self.transforms are set. I also read the code of the CocoDataset class and found out that the method _load_image converts the Image to RGB:

    def _load_image(self, id: int) -> Image.Image:
        path = self.coco.loadImgs(id)[0]["file_name"]
        return, path)).convert("RGB")

Therefor I have overwritten the _load_image method to return the Image without converting it to “RGB”. Now the Image transform works as well as the Dataloader function. (I’ve written this part because it took me quite some time to figure it out - in the end I only had to read the code but I googled a lot beforehand and found nothing).

Loading my Mask R-CNN model looks like this:

def load_mask_rcnn_overwrite_heads_and_send_to_gpu(grayscale_stats):
    num_classes = 10 
    model = models.detection.maskrcnn_resnet50_fpn_v2(weights=models.detection.MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT, weights_backbone=models.ResNet50_Weights.IMAGENET1K_V2)
    gt = GeneralizedRCNNTransform(min_size=3784, max_size=4000, image_mean=torch.Tensor([grayscale_stats[0]]).repeat(3), image_std=torch.Tensor([grayscale_stats[1]]).repeat(3))
    model.transform = gt
    # get number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    # now get the number of input features for the mask classifier
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    # and replace the mask predictor with a new one
    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask, hidden_layer, num_classes)
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    model = nn.DataParallel(model)
    return model, device

prepare_targets is used to send the tensors to the GPUs:

def prepare_targets(target):
    targets = {}
    targets['id'] = [ for i in target['id'] ]
    targets['image_id'] = target['image_id'].to(device)
    targets['category_id'] = [ for i in target['category_id'] ]
    targets['segmentation'] = [ [ for j in i[0]] for i in target['segmentation'] ]
    targets['area'] = [ for i in target["area"] ]
    targets['bbox'] = [ [ for j in i ] for i in target["bbox"] ]
    targets['iscrowd'] = [ for i in target["iscrowd"] ]
    targets['boxes'] = target['boxes'][0].to(device)
    targets['masks'] = target['masks'].to(device)
    targets['labels'] = target['labels'][0].to(device)
    return targets

The following is a part of my train loop:

for image, target in train_loader:  
    # Move the image and targets to the GPU if available
    images = [image[0].to(device)] # The index 0 is necessary for the correct shape, for whatever reason it is not in the correct shape when loading the data
    # images = [image[0].to(device) for image in images]
    targets = prepare_targets(target) 
    # Building the targets
    print(f"Passing image with id {target['image_id']} for training")
    # Forward pass
    outputs = model(images, [targets])

When running a single loop with a batch_size of 1 and 89 targets the outputs show that there is neither a mask loss or a loss of the box regression:

>>> outputs
{'loss_classifier': tensor([2.4278], device='cuda:0', grad_fn=<GatherBackward>), 'loss_box_reg': tensor([0.], device='cuda:0', grad_fn=<GatherBackward>), 'loss_mask': tensor([0.], device='cuda:0', grad_fn=<GatherBackward>), 'loss_objectness': tensor([0.1328], device='cuda:0', grad_fn=<GatherBackward>), 'loss_rpn_box_reg': tensor([0.1339], device='cuda:0', grad_fn=<GatherBackward>)}

I can’t figure out what the reason is why the loss of the box regression or mask is 0. Is there a problem with overwriting the heads? Or is it something completely different?