Difference between torchvision and detectron2 for Faster R-CNN

Hello, I’m trying to reproduce the results of some papers on a public object detection dataset with COCO annotations. They report training a Faster R-CNN with:

  • SGD optimizer
  • Initial learning rate of 1e-3, reduced to 1e-4 after several iterations
  • Momentum of 0.9 and weight decay of 5e-4

However, if I try to train a torchvision Faster R-CNN pretrained on COCO on the same dataset (following the finetuning tutorial) the loss (in particular the rpn_box_reg loss) diverges to NaN after a few iterations, and It only trains when reducing the initial learning rate to 1e-5. Instead, a Faster R-CNN model from detectron2 correctly trains on the same dataset with learning rate 1e-3. In both cases, I am using a ResNet50 backbone with FPN.
I am so wondering if there is any important difference between the two implementations which could lead to that different behavior, and if it suggested preferring one over the other.
The implementations by the paper authors are based on maskrcnn-benchmark or faster-rcnn.pytorch, which are both deprecated in favor of detectron2.

Thank you in advance for your help.