Updating learning rate after accumulating gradient

chantk · March 31, 2020, 1:21am

Hello, I have some idea on how to update the learning rate of optimizer but I am not 100% sure of the correct way to update the learning rate if you are accumulating the gradient to achieve a larger batch size.

For example: if I use the following code to accumulate gradient on training examples of batch size 8 for 4 iterations in order to achieve a 32 batch size.

opt.zero_grad()
for i, (input, target) in enumerate(dataset):
pred = net(input)
loss = crit(pred, target)
loss.backward()
# graph is cleared here
if (i+1)%4 == 0:
# every 4 iterations of batches of size 8
opt.step()
opt.zero_grad()

Then assuming if I want to update my learning rate every 200 iterations (every 200 gradient update of virtual batch size 32), the code should following the same principle

if 200*(i+1)%4 == 0:
for param_group in opt.param_groups:
param_group[‘lr’] = param_group[‘lr’] * 0.9

Or is there no need to schedule it that way and instead perform the following

if i%200 == 0:
for param_group in opt.param_groups:
param_group[‘lr’] = param_group[‘lr’] * 0.9

logically it seems like the first one is correct though. if the first one is correct, what error would it cause in the second code? gradient update with different learning rate within the same batch?

JuanFMontesinos · April 1, 2020, 7:15pm

Kind of yes. Optimizer is agnostic to the learning rate. Batch would be updated with a single learning rate anyway as the one which counts is the one existing when you call opt.step