Trying to use ConvNeXt as Faster-RCNN backbone

I’m having a little trouble trying to train a Faster-RCNN model on COCO, with an ImageNet-pretrained torchvision ConvNeXt as the backbone, as shown below:

import torch
import torchvision.models.detection as torchdet
from torchvision.models import convnext_tiny, ConvNeXt_Tiny_Weights

backbone = convnext_tiny(weights=ConvNeXt_Tiny_Weights.DEFAULT).features
# 768 determined using torchinfo.summary(backbone, (3,300,300))
backbone.out_channels = 768

# 5x3 per location
anchor_generator = torchdet.rpn.AnchorGenerator(
    sizes=((32, 64, 128, 256, 512),), aspect_ratios=((0.5, 1.0, 2.0),))

roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'],

# 91 classes in MS COCO
model = torchdet.FasterRCNN(backbone=backbone, num_classes=91,
                            rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)

I’m trying to emulate the training recipes used by the Torchvision team, so my setup looks like this:

params = [p for p in model.parameters() if p.requires_grad]
# 0.0025 LR used because only using 1 GPU, Facebook used 0.02 for 8 GPUs
optimizer = torch.optim.SGD(params, lr=0.0025, momentum=0.9, weight_decay=1e-4)
lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[16, 22], gamma=0.1)

#Using batchsize 2

However, the loss will only drop to about 0.5 before jumping back up to 1.8, across iterations in just one epoch. Is there something else I should change about the training to better match what the Torchvision team did for training its other FasterRCNN models?

I’m also curious what the featmap_names in torchvision.ops.MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'],...) is supposed to do?