Loss function for object detection

Hello!

I’m training a small network to detect and locate a certain type of snail in an image. The network consists of a few Conv2d layers, a couple of Linear layers and the final output is a vector (x, y, width, height, has_snail). These five elements goes through a sigmoid activation.

Training the network with ~80 augmented images resized to 512*384 pixels and MSE loss results in a decent performance.

However, I’d like to use BCE loss for has_snail and MSE for the everything else, but the model seems to get much worse by doing that. It’s unstable, it doesn’t predict the locations as well as before, and the “has_snail” is always very close to 1 (the dataset is imbalanced).

My loss function is as follows:

import torch.nn.functional as F
import torch.nn as nn

class MSE_BCE_loss(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, predictions, target):
        mse = F.mse_loss(predictions[:,:4], target[:,:4])
        bce = F.binary_cross_entropy(predictions[:,4:], target[:,4:])

        return 0.5 * mse + 0.5 * bce

What am I doing wrong? Is it something in the code or is it just wrong to add two different types of losses together like this? I’ve tried adjusting the coefficients of the two losses without any apparent effect.