I was playing around with the MNIST example, and noticed the doing TOO big batches (10,000 images per batch) seem to be hurting accuracy scores.
This got me thinking : What are the general rules of thumb you guys have found when it comes to batch sizes?
Also, what is your favorite optimizer and activation function?
The batch size is usually set between 64 and 256.
The batch size does have an effect on the final test accuracy. One way to think about it is that smaller batches means that the number of parameter updates per epoch is greater. Inherently, this update will be much more noisy as the loss is computed over a smaller subset of the data. However, this noise seems to help the generalization of the model.
Refer to: https://arxiv.org/abs/1609.04836 and https://arxiv.org/abs/1703.04933 for two recent studies on this subject
Thank you for the papers! I was trying to minimize noise through massive batches. The papers you provided are a great explanation of why massive batches may not work.