Forward one object at a time or all objects in the image at a time as a target during object detection

As I know we should label all objects we have in an image for certain category since otherwise loss function will penalize detections which are correct but not present in the target.

But it doesn’t work thus in practice for me.
I’m using torchvision.models.detection.fasterrcnn_resnet50_fpn for custom object detection.
I have images which contain from 1 to ~20 objects.

I tried two different approaches for creating target variable:

  1. target variable consists of only one object every time:
[{'boxes': tensor([[313,  34, 369,  62]], device='cuda:0'), 
'labels': tensor([13], device='cuda:0')}]
  1. target variable consists of all objects that are present in image:
[{'boxes': tensor([[313,  34, 369,  62], [332, 244, 389, 274]], device='cuda:0'),
 'labels': tensor([13, 13], device='cuda:0')}]

In that specific task the smallest loss I could get is when create each individual target for each individual object (first case).

It doesn’t fit with theory so can someone tell me what I am doing wrong?