SGD weight decay by batch or epoch

I was wondering whether it is better to decay the learning rate using the number of batches than using the number of epochs, especially when working with different size datasets. Any insights/ good practice relative to this?

I think it depends on your problem like almost always.
You can find some intuition in the CS231n notes on this topic, but it’s also very vague.

1 Like