Grad values update in optimizer.step()

Hello everyone,

I am learningg SGD optimizer to train LeNet and AlexNet. I found out that during optimizer.step(), the param’s are updated by -1 * lr * grad’s, instead of -1 * lr * grad’s/batch_size ( in param.add(d_p, alpha=-lr) ). This way, when I increase the batch_size from e.g. 100 to 1000, I need to decrease the learning_rate by the same magnitude.

To try to keep the learning_rate less impacted by the change of batch_size, I manually add the following codes before optimizer.step():

    for p in optimizer.param_groups[0]['params']:
        p.grad = p.grad / batch_size

Does anyone experience the same confusion? Or is this the correct way to train the network with the optimizer?


normally, loss magnitude doesn’t increase with batch size because of reduction=‘mean’, hence this adjustment is not needed/correct

@ffan1980 as mentioned by @googlebot imagine it this way. You are diving the grads by a static number. The only thing you are doing is effectively using a lower learning rate.

If you want to add value by doing this operation, you should use some dynamic value extracted from the data/grads

@googlebot @anantguptadbl Thank you for the replies.