Lowest validation loss produces more false positives

I’m training a faster r-cnn object detector and have used the following to produce the loss for validation:

def evaluate_loss(model, data_loader, device):
    val_loss = 0
    with torch.no_grad():
      for images, targets in data_loader:
          images = list(image.to(device) for image in images)
          targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

          losses_dict, detections = eval_forward(model, images, targets)
          losses = sum(loss for loss in losses_dict.values())
          val_loss += losses
    validation_loss = val_loss/ len(data_loader)    
    return validation_loss

essentially, it sums up all types of losses output and divides this by the number of batches to give me a figure for the epoch.

However, I compared the output of a model that has the lowest validation loss, compared to another model that is output at a later epoch and found that although those models are less “sensitive”, they produce vastly less false positives.

As an FYI, I am using cross entropy loss and finding a few similar sized targets in a 512 x 512 pixel image (binary classification - object + background).

Is there any reason why this might be happening?

What’s the balance of positive/negative examples in your dataset? For example, it could be due to the model being more “accurate” overall if the number of false negatives is smaller despite more false positives.

1 Like

Well it’s object detection, looking for objects of 32 x 32 pixels in an image size of 512 x 512 pixels. So there is typically a lot of background class.

In regard to my validation outputs, for the model with the lowest validation loss:

true positive: 610
false negative: 25
false positive: 740

And for the model from a later epoch:

true positive: 595
false negative: 40
false positive: 383

So yes, there is sizeable discrepancy between the true positives and false negatives between the two models… and so it more accurate in terms of “sensitivity” measures. Although, it’s not very precise…

Also as an FYI, I do not apply a threshold on the probabilities of the outputs. These numbers are representative of every detection.