Fine-tuning detection network problem


I am trying to finetune the object detection task.

I am using the pre-trained Faster R-CNN using ResNet50 as a backbone.

Usually, the object detection network is trained with the big scene images.
But when I feed in the cropped person images in the network backbone(ResNet50) while fine-tuning
the Faster R-CNN network with scene images, I figured out performance gradually drops.

Even though I freezed all the layers, the performance drops.

I think it is because Faster R-CNN network layers(conv, fc, BN layer) is trained with the scene images
and also BN layer(running_mean,running_var) is calculated with the scene images, so the values are fit for scene images.
But if I feed in the cropped person images during training, even if I freeze all layers,
the BN layer(running_mean,running_var) is calculated according to mini-batches(scene + cropped person in this case) which are going to be used in inference.
So, I think the BN layer(running_mean,running_var) is corrupted?? and finally, the network inference drops.

Is this correct??