Hello everyone,
I am learningg SGD optimizer to train LeNet and AlexNet. I found out that during optimizer.step(), the param’s are updated by -1 * lr * grad’s, instead of -1 * lr * grad’s/batch_size ( in functional.py: param.add(d_p, alpha=-lr) ). This way, when I increase the batch_size from e.g. 100 to 1000, I need to decrease the learning_rate by the same magnitude.
To try to keep the learning_rate less impacted by the change of batch_size, I manually add the following codes before optimizer.step():
for p in optimizer.param_groups[0]['params']:
p.grad = p.grad / batch_size
optimizer.step()
Does anyone experience the same confusion? Or is this the correct way to train the network with the optimizer?
Thanks,
Fan