Grad values update in optimizer.step()

ffan1980 · January 16, 2022, 5:35pm

Hello everyone,

I am learningg SGD optimizer to train LeNet and AlexNet. I found out that during optimizer.step(), the param’s are updated by -1 * lr * grad’s, instead of -1 * lr * grad’s/batch_size ( in functional.py: param.add(d_p, alpha=-lr) ). This way, when I increase the batch_size from e.g. 100 to 1000, I need to decrease the learning_rate by the same magnitude.

To try to keep the learning_rate less impacted by the change of batch_size, I manually add the following codes before optimizer.step():

    for p in optimizer.param_groups[0]['params']:
        p.grad = p.grad / batch_size
    optimizer.step()

Does anyone experience the same confusion? Or is this the correct way to train the network with the optimizer?

Thanks,
Fan

googlebot · January 16, 2022, 6:13pm

normally, loss magnitude doesn’t increase with batch size because of reduction=‘mean’, hence this adjustment is not needed/correct

anantguptadbl · January 17, 2022, 6:12am

@ffan1980 as mentioned by @googlebot imagine it this way. You are diving the grads by a static number. The only thing you are doing is effectively using a lower learning rate.

If you want to add value by doing this operation, you should use some dynamic value extracted from the data/grads

ffan1980 · January 17, 2022, 4:18pm

@googlebot @anantguptadbl Thank you for the replies.