I am training a model with 2 Adam optimizers (in separate code), but one of the optimizers performs more superior than the other. The model I am using can be found here Model, and the first optimizer is what is being used by the repository of the above model is

```
optim = torch.optim.Adam(
[
{'params': get_parameters(model, bias=False)},
{'params': get_parameters(model, bias=True), 'lr': cfg['lr'] * 2, 'weight_decay': 0},
],
lr=cfg['lr'],
weight_decay=cfg['weight_decay'])
```

The get_parameters function can be found here get_parameters.

The configurations dictionary is

```
configurations = {
1: dict(
max_iteration=1000000,
lr=1e-4,
momentum=0.9,
weight_decay=0.0,
gamma=0.25,
step_size=32300, # "lr_policy: step"
interval_validate=1000,
),
}
cfg = configurations[1]
```

The above optimizer achieves performance of 28.6dB in just 50k iterations. I then used the more common Adam optimizer definition

```
optimizer = torch.optim.Adam(model.parameters(),
lr=args.lr,
weight_decay=args.weight_decay)
```

The learning rate I used is `1e-4`

and weight decay `1e-4`

. This optimizer gives results of only about 27.5dB even after 100k iterations. Also the rest of the code is identical for both the optimizers. So why is this optimizer worse than the previous one?