Adam optimizer performance difference

Sudu · September 16, 2021, 5:39am

I am training a model with 2 Adam optimizers (in separate code), but one of the optimizers performs more superior than the other. The model I am using can be found here Model, and the first optimizer is what is being used by the repository of the above model is

optim = torch.optim.Adam(
            [
                {'params': get_parameters(model, bias=False)},
                {'params': get_parameters(model, bias=True), 'lr': cfg['lr'] * 2, 'weight_decay': 0},
            ],
            lr=cfg['lr'],
            weight_decay=cfg['weight_decay'])

The get_parameters function can be found here get_parameters.
The configurations dictionary is

configurations = {
    1: dict(
        max_iteration=1000000,
        lr=1e-4,
        momentum=0.9,
        weight_decay=0.0,
        gamma=0.25,
        step_size=32300, # "lr_policy: step"
        interval_validate=1000,
    ),
}
cfg = configurations[1]

The above optimizer achieves performance of 28.6dB in just 50k iterations. I then used the more common Adam optimizer definition

optimizer = torch.optim.Adam(model.parameters(),
            lr=args.lr,
            weight_decay=args.weight_decay)

The learning rate I used is 1e-4 and weight decay 1e-4. This optimizer gives results of only about 27.5dB even after 100k iterations. Also the rest of the code is identical for both the optimizers. So why is this optimizer worse than the previous one?