In the Mask R-CNN paper the optimizer is described as follows training on MS COCO 2014/2015 dataset for instance segmentation (I believe this is the dataset, correct me if this is wrong)
We train on 8 GPUs (so effective minibatch
size is 16) for 160k iterations, with a learning rate of
0.02 which is decreased by 10 at the 120k iteration. We
use a weight decay of 0.0001 and momentum of 0.9. With
ResNeXt , we train with 1 image per GPU and the same
number of iterations, with a starting learning rate of 0.01.
I’m trying to write an optimizer and learning rate scheduler in Pytorch for a similar application, to match this description.
For the optimizer I have:
def get_Mask_RCNN_Optimizer(model, learning_rate=0.02): optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=0.0001) return optimizer
For the learning rate scheduler I have:
def get_MASK_RCNN_LR_Scheduler(optimizer, step_size): scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=step_size, gammma=0.1, verbose=True) return scheduler
When the authors say “decreased by 10” do they mean divide by 10? Or do they literally mean subtract by 10, in which case we have a negative learning rate, which seems odd/wrong. Any insights appreciated.