Dynamic batch size learning rate

I have implemented a custom DataLoader batch_sampler to have dynamic batch sizes during training. The first batch has a fixed size but the rest do not. e.g:

original_batch_size = 5

  • iteration 1: original_batch_size samples
  • iteration 2: 8 samples
  • iteration 3: 13 samples

Does that mean that I should scale the learning rate every iteration before calling optimizer.step()?

# pseudocode:
prev_lr = optimizer["lr"]
# scale the learning rate depending on the current batch size
# for example, one batch may have 2 samples but another may have 100
# and we dont want to use the same learning rate in both cases
optimizer["lr"] = prev_lr / original_batch_size * current_batch_size
# update parameters
optimizer.step()
# keep original value
optimizer["lr"] = prev_lr

This would also apply when using a DataLoader with drop_last=False, where the last batch might have less samples than the rest of the batches, correct?

Usually, loss is divided by batch size (i.e. reduction=“mean”) to have the same effect.

I do not understand how that would have the same effect. It would if all of the batches have the same batch_size (normal case)

But when every batch can have a different length, you would be using the same learning rate for a loss averaged over 1 sample than one averaged over a 1000 samples. But we want a larger learning rate in the second case, don’t we?

For both batches you’ll get unbiased estimates of global dataset’s gradients directions. Estimators from small batches will have higher variance, but I don’t think that this is a problem with realistic batch sizes, as if I’m not mistaken, sampling errors reduce as 1/sqrt(batch_size).

There is also shuffling to smooth steps taken in the long term.

1 Like